pytorch
d1a0e880 - Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40207)

Commit
4 years ago
Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40207) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40207 Blocking wait didn't work for dist.barrier() since we performed a cudaDeviceSynchronize() before we performed any of the timeout checks. As a result, in case of failures/desync the barrier() call would get stuck on cudaDeviceSynchrnonize() and would never return a timeout error to the user. To fix this, I've moved the device synchronization after the timeout checks. ghstack-source-id: 106123004 Test Plan: waitforbuildbot Differential Revision: D22108899 fbshipit-source-id: 6b109ef9357e9464e7d66b540caabf5801e6a44a
Author
Parents
Loading