pytorch
15eed5b7 - [Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568)

Commit
1 year ago
[Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568) This test(https://github.com/pytorch/pytorch/blob/8340762211e3b55caa178bac748bd902249f6fc0/test/distributed/test_multi_threaded_pg.py#L133 ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568 Approved by: https://github.com/H-Huang
Author
Committer
Parents
Loading