[Resubmit] Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40249
Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.
To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106250153
ghstack-source-id: 106250153
Test Plan: waitforbuildbot
Differential Revision: D22126152
fbshipit-source-id: d919a7a6507cca7111d8ad72e916777b986d0d67