Stop runner service when its GPU crashes (#97585)
Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover. Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail.
This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step. This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure. Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure. Here are the symptoms when the GPU crashes:
* Test fails with "No CUDA GPUs are available" error when initialize CUDA. For examples:
* https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519
* https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759
* Calling nvidia-smi timeouts after 60 second. For example:
* https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448
* Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error
* https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600
* Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error`
* https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872
I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process.
### Testing
https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805. When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service.
The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585
Approved by: https://github.com/jeanschmidt