[ROCm] Split ROCm pytest into single- and multi-accelerator passes
- Use the existing `multiaccelerator` pytest marker to split ROCm CI
tests into two passes, matching the TPU pytest workflow.
- Single-accelerator pass runs with xdist parallelism and
`-m "not multiaccelerator"`.
- Multi-accelerator pass runs without xdist across all GPUs with
`-m "multiaccelerator"`, skipped on single-GPU machines.
- Override ROCR_VISIBLE_DEVICES in conftest.py xdist hook to ensure
workers are always pinned to the correct GPU.
(cherry picked from commit 663efe75aa4720bf95602209197b6548521a48c5)
(merge commit found here: 7cb36e57959f5a0a7e9d0db69918ea319e7d0e8f)