服务器训练过程程序崩溃,显卡资源释放方式 使用服务器训练过程出现程序崩溃但是显卡资源未能释放的问题解决方式主要是多卡使用过程不能影响其他人正在使用的显卡资源。一、查看显卡使用情况查看显卡正在使用的进程watch nvidia-smi|NVIDIA-SMI580.126.09Driver Version:580.126.09CUDA Version:13.0|---------------------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr.ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA A100-SXM4-80GB Off|00000000:00:09.0Off|0||N/A39C P089W/400W|24375MiB/81920MiB|0%Default||||Disabled|---------------------------------------------------------------------------------------|1NVIDIA A100-SXM4-80GB Off|00000000:00:0A.0Off|0||N/A35C P095W/400W|45173MiB/81920MiB|0%Default||||Disabled|---------------------------------------------------------------------------------------|2NVIDIA A100-SXM4-80GB Off|00000000:00:0B.0Off|0||N/A39C P093W/400W|77961MiB/81920MiB|56%Default||||Disabled|---------------------------------------------------------------------------------------|3NVIDIA A100-SXM4-80GB Off|00000000:00:0C.0Off|0||N/A37C P099W/400W|78351MiB/81920MiB|100%Default||||Disabled|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||2N/A N/A3342634C/bin/python377918MiB||3N/A N/A3342635C/bin/python378308MiB|程序用的卡0和1训练但是程序已经崩溃卡0和1资源没有释放。二、查看显卡正在使用的进程查看显卡正在使用的进程sudo fuser-v/dev/nvidia*USER PID ACCESS COMMAND/dev/nvidia0:root3341936F...m pt_main_thread root3342634F....pt_main_thread root3342635F....pt_main_thread root3348049F...m python3 root3348112F...m python3/dev/nvidia1:root3341936F...m pt_main_thread root3342634F....pt_main_thread root3342635F....pt_main_thread root3348049F...m python3 root3348112F...m python3/dev/nvidia2:root3341936F...m pt_main_thread root3342634F...m pt_main_thread root3342635F...m pt_main_thread root3348049F....python3 root3348112F....python3/dev/nvidia3:root3341936F...m pt_main_thread root3342634F...m pt_main_thread root3342635F...m pt_main_thread root3348049F....python3 root3348112F....python3三、查看进程对应的节点信息ps-p3348049,3348112-o pid,ppid,stat,cmdPID PPID STAT CMD33480492589037Sl/opt/bin/python3-u tools/train.py--local-rank0projects/configs/stage.py--launcher pytorch--deterministic--work-dir./work_dirs/stage33481122589037Sl/opt/bin/python3-u tools/train.py--local-rank0projects/configs/stage.py--launcher pytorch--deterministic--work-dir./work_dirs/stage发现这两个进行号对应的是之前所用的程序导致的崩溃。四、kill掉对应的进程号sudo kill-933480493348112再运行第一步显示显卡资源正常释放五、停止之前的容器重新进入docker stop e847dc3213cf docker start e847dc3213cf docker exec-it e847dc3213cf bash六、多卡通信延迟报错export NCCL_TIMEOUT36000000