pytorch
4563adac - Update the use of nvidia-smi for GPU healthcheck (#98036)

Commit
1 year ago
Update the use of nvidia-smi for GPU healthcheck (#98036) This goes together with https://github.com/pytorch/test-infra/pull/3967 to: * Provide a more accurate health check command with `nvidia-smi` * Avoid running the check in the edge case when `nvidia-smi` doesn't even exist due to GitHub outage, i.e. https://github.com/pytorch/pytorch/actions/runs/4591098682/jobs/8107204277 * Also check for the number of GPU as part of the health check. The number of GPUs needs to be a power of 2 on a healthy runner. Fixes https://github.com/pytorch/test-infra/issues/4000 ### Testing Luckily, the PR picked up the broken runner https://github.com/pytorch/pytorch/actions/runs/4640688249/jobs/8213191715, and the script correctly detected that the runner had only 3/4 GPUS and shut it down Pull Request resolved: https://github.com/pytorch/pytorch/pull/98036 Approved by: https://github.com/weiwangmeta
Author
Committer
Parents
Loading