[ROCm] fix parallel test runners and device visibility (#91137)
Fixes #90940. This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner.
First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts. ROCm runners have at least 1 GPU each, but often 2 or more. This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU.
Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners. This effort wasn't fully realized; to date, we haven't had more than one runner per CI host. We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony