pytorch
d5204064 - [BE] Fix flaky ProcessGroupGloo tests (#61396)

Commit
3 years ago
[BE] Fix flaky ProcessGroupGloo tests (#61396) Summary: A hypothesis as to why tests such as https://github.com/pytorch/pytorch/issues/57469 may be flaky is due to `c10d = ProcessGroupGloo(...)` is not actually guaranteed to be a synchronization point, so some ranks may create the PG, run all the error checking (which does not actually call into gloo APIs so doesn't require synchronization), and then exit, all before other ranks have created the gloo pg. This can result in the following error: ``` File "distributed/test_c10d_gloo.py", line 1037, in test_reduce_checks May 03 06:42:34 pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts()) May 03 06:42:34 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:35521 ``` which indicates that the remote end has hung up. Furthermore all the flaky tests in this file only do error checking and don't call into the gloo APIs, further indicating that this issue may be the root cause. Not 100% sure this PR will fix it because I haven't been able to actually repro the issue even after 10000+ runs, but it happens regularly in CI. To fix this, we add a `dist.barrier(group=pg)` call after creating the pg to enforce a synchronization. Would be good to land this and observe whether it helps with the flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61396 Reviewed By: mrshenli Differential Revision: D29664189 Pulled By: rohan-varma fbshipit-source-id: bc046d5d816fe6cb426522b85312383bfa3f90b7
Author
Parents
Loading