distributed_test: Map rank to GPU accordingly (#47898)
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in https://github.com/pytorch/pytorch/issues/45435 and https://github.com/pytorch/pytorch/issues/47629
For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR https://github.com/pytorch/pytorch/issues/45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.
This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2
Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
Fixes https://github.com/pytorch/pytorch/issues/47629
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47898
Reviewed By: smessmer
Differential Revision: D24956021
Pulled By: rohan-varma
fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e