pytorch
06a56637 - [PyTorch/NCCL] Fix async error handling (#45456)

Commit

4 years ago

[PyTorch/NCCL] Fix async error handling (#45456) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456 Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100% SyncBatchNorm failure trace: P143879560 Test Plan: **Desync test:** BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync **SyncBatchNorm test:** BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient Reviewed By: osalpekar Differential Revision: D23972071 fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f

Author

jiayisuse

Committer

facebook-github-bot

Parents

ef414725

pytorch 06a56637 - [PyTorch/NCCL] Fix async error handling (#45456)

pytorch
06a56637 - [PyTorch/NCCL] Fix async error handling (#45456)