pytorch
79931889 - Fix distributed_test.py flakiness (#78797)

Commit
2 years ago
Fix distributed_test.py flakiness (#78797) There are several recent distributed_test / DDP flaky tests: https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22oncall%3A+distributed%22+label%3A%22module%3A+flaky-tests%22+ From local experimentation, we see segfaults such as the error in https://github.com/pytorch/pytorch/issues/78684 quite a bit locally when running with NCCL. I switched the test to run with gloo, and these issues appeared gone. I then switched back to nccl but turned off async_errror_handling (some of the stacktrace had ncclCommWatchdog + workCleanupLoop in the trace, so I thought it might be an issue / race between the two or the like). Turning off async_error_handling also seems to alleviate the tests. If this indeed works, we should probably land this PR as we are losing a lot of CI signal, and prioritize to understand why async error handling / comm watchdog interaction might be causing these segfaults. Closes https://github.com/pytorch/pytorch/issues/78768 https://github.com/pytorch/pytorch/issues/78767 https://github.com/pytorch/pytorch/issues/78748 https://github.com/pytorch/pytorch/issues/78685 https://github.com/pytorch/pytorch/issues/78684 https://github.com/pytorch/pytorch/issues/78641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78797 Approved by: https://github.com/wanchaol, https://github.com/fduwjj
Author
Committer
Parents
Loading