Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921
Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.
Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.
Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.
First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129
Note that the test failure cannot be reproduced locally.
Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.
#Closes: https://github.com/pytorch/pytorch/issues/53899
ghstack-source-id: 126414937
Test Plan:
```
export BACKEND=mpi
export WORLD_SIZE=2
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs
```
```
#!/bin/bash
for i in {1..100}
do
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py
done
```
The CI tests triggered by a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi
Reviewed By: mrshenli
Differential Revision: D27245421
fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27