pytorch
a80dd02a - [Resubmit] Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40249)

Commit
4 years ago
[Resubmit] Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40249) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40249 Blocking wait didn't work for dist.barrier() since we performed a cudaDeviceSynchronize() before we performed any of the timeout checks. As a result, in case of failures/desync the barrier() call would get stuck on cudaDeviceSynchrnonize() and would never return a timeout error to the user. To fix this, I've moved the device synchronization after the timeout checks. ghstack-source-id: 106250153 ghstack-source-id: 106250153 Test Plan: waitforbuildbot Differential Revision: D22126152 fbshipit-source-id: d919a7a6507cca7111d8ad72e916777b986d0d67
Author
Parents
Loading