Fix distributed_test.py flakiness (#78797)
There are several recent distributed_test / DDP flaky tests: https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22oncall%3A+distributed%22+label%3A%22module%3A+flaky-tests%22+
From local experimentation, we see segfaults such as the error in https://github.com/pytorch/pytorch/issues/78684 quite a bit locally when running with NCCL. I switched the test to run with gloo, and these issues appeared gone.
I then switched back to nccl but turned off async_errror_handling (some of the stacktrace had ncclCommWatchdog + workCleanupLoop in the trace, so I thought it might be an issue / race between the two or the like). Turning off async_error_handling also seems to alleviate the tests. If this indeed works, we should probably land this PR as we are losing a lot of CI signal, and prioritize to understand why async error handling / comm watchdog interaction might be causing these segfaults.
Closes https://github.com/pytorch/pytorch/issues/78768 https://github.com/pytorch/pytorch/issues/78767 https://github.com/pytorch/pytorch/issues/78748 https://github.com/pytorch/pytorch/issues/78685 https://github.com/pytorch/pytorch/issues/78684 https://github.com/pytorch/pytorch/issues/78641
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78797
Approved by: https://github.com/wanchaol, https://github.com/fduwjj