Raise proper timeout when sharing the distributed shared seed (#81666) (#81666) (#81892)
Summary:
Fixes https://github.com/pytorch/data/issues/659
- This would fix the problem that a slow DataLoader on rank 0 would cause TimeoutError as I have removed the `wait` operation on other Ranks.
- This PR also adds a [default timeout](https://github.com/pytorch/pytorch/blob/f6a45f79841fb7cdc4dfa294dbdd66d7e4b75c18/torch/csrc/distributed/c10d/ProcessGroup.hpp#L26-L27) as 30 * 60 seconds (taking reference from the distributed team's implementation). When the distributed seed is stuck on any rank, a proper timeout with detailed message will be raised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81666
Approved by: https://github.com/NivekT
Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/aa1466d542c6addd5719f268f57ccdcbc0dbf84f
Reviewed By: jeanschmidt
Differential Revision: D37990752
Pulled By: ejguan
fbshipit-source-id: 41639341aa737ab64de1992db5ed43cbb110ec91
Co-authored-by: erjia (Meta Employee) <erjia@fb.com>