pytorch
c248f2f3 - [ROCm] Modify GPUs visibility code when starting docker container (#91031)

Commit

2 years ago

[ROCm] Modify GPUs visibility code when starting docker container (#91031) Use ROCR_VISIBLE_DEVICES to limit GPU visibility, in preparation for CI node upgrade to ROCm5.3 KFD and UB22.04. ### PROBLEM After upgrading some of our CI nodes to UB22.04 and ROCm5.3KFD, rocminfo doesn't work inside the docker container if we use the following flags: `--device=/dev/dri/renderD128 --device=/dev/dri/renderD129`. It gives the error: ``` + rocminfo ROCk module is loaded Failed to set mem policy for GPU [0x6b0d] hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1140 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. ``` ### WORKAROUND Use `--device=/dev/dri` instead, and use `ROCR_VISIBLE_DEVICES` to limit GPU visibility inside container. ### BACKGROUND OF ORIGINAL CODE We introduced these flags to prepare for 2 runners per CI node, to split up the GPU visibility among the runners: https://github.com/pytorch/pytorch/blame/master/.github/actions/setup-rocm/action.yml#L58 That effort - 2 runners per CI node - is still pending, and we might need to revisit this patch when we try to enable that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91031 Approved by: https://github.com/jeffdaily, https://github.com/malfet

Author

jithunnair-amd

Committer

pytorchmergebot

Parents

f460893c

pytorch c248f2f3 - [ROCm] Modify GPUs visibility code when starting docker container (#91031)

pytorch
c248f2f3 - [ROCm] Modify GPUs visibility code when starting docker container (#91031)