pytorch
a6f7dd47 - Catch cuda driver shutdown error in NCCLWatchdog (#106503)

Commit View On GitHub

Commit

1 year ago

Catch cuda driver shutdown error in NCCLWatchdog (#106503) There is a design flaw in NCCLWatchdog, namely it spawns threads that talk to the CUDA api, but the CUDA api may have been deinitialized, forming a race. This is a known issue with widespread impact (https://github.com/pytorch/pytorch/issues/90848). I should point out that i tested this fix on the repro command for https://github.com/pytorch/pytorch/issues/82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error. A partial fix was landed already, but it applied too narrowly: https://github.com/pytorch/pytorch/commit/ec071a0815adfd180ba1ab103be2f7d2227f07cc This PR is a copy-paste of the previous fix, applying to one more case, plugging a hole. We probably need to do a more thorough review and either plug all the holes, or design this differently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106503 Approved by: https://github.com/kwen2501

Author

wconstab

Committer

pytorchmergebot

Parents

c9c2b14c

pytorch a6f7dd47 - Catch cuda driver shutdown error in NCCLWatchdog (#106503)

Commit

pytorch
a6f7dd47 - Catch cuda driver shutdown error in NCCLWatchdog (#106503)