pytorch
de5e3b5e - Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)

Commit View On GitHub

Commit

3 years ago

Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: https://github.com/pytorch/pytorch/issues/53899 ghstack-source-id: 126414937 Test Plan: ``` export BACKEND=mpi export WORLD_SIZE=2 pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs ``` ``` #!/bin/bash for i in {1..100} do pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py done ``` The CI tests triggered by a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi Reviewed By: mrshenli Differential Revision: D27245421 fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27

Author

Yi Wang

Committer

facebook-github-bot

Parents

c218ac3b

pytorch de5e3b5e - Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)

Commit

pytorch
de5e3b5e - Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)