Ensure NCCL_BLOCKING_WAIT=1 works for dist.barrier() (#40207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40207
Blocking wait didn't work for dist.barrier() since we performed a
cudaDeviceSynchronize() before we performed any of the timeout checks. As a
result, in case of failures/desync the barrier() call would get stuck on
cudaDeviceSynchrnonize() and would never return a timeout error to the user.
To fix this, I've moved the device synchronization after the timeout checks.
ghstack-source-id: 106123004
Test Plan: waitforbuildbot
Differential Revision: D22108899
fbshipit-source-id: 6b109ef9357e9464e7d66b540caabf5801e6a44a