pytorch
c18e8c68 - [ROCm] fix parallel test runners and device visibility (#91137)

Commit

2 years ago

[ROCm] fix parallel test runners and device visibility (#91137) Fixes #90940. This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner. First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts. ROCm runners have at least 1 GPU each, but often 2 or more. This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU. Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners. This effort wasn't fully realized; to date, we haven't had more than one runner per CI host. We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony

Author

jeffdaily

Committer

pytorchmergebot

Parents

5a601903

pytorch c18e8c68 - [ROCm] fix parallel test runners and device visibility (#91137)

pytorch
c18e8c68 - [ROCm] fix parallel test runners and device visibility (#91137)