Avoid poisoning process with CUDA calls as soon as importing (#6810)
Call `torch.cuda.device_count() > 0` before `torch.cuda.is_available()`,
to give priority to nvml based availability, so that we can try not to
poison process with CUDA calls as soon as we execute `import deepspeed`.
https://github.com/pytorch/pytorch/blob/v2.5.1/torch/cuda/__init__.py#L120-L124
There are 2 reasons to make this change:
Firstly, if we accidentally import deepspeed, since the CUDA runtime
initializes when the first CUDA API call is made and caches the device
list, changing the CUDA_VISIBLE_DEVICES within the same process after
initialization won't have any effect on the visible devices. The
specific case:
https://github.com/OpenRLHF/OpenRLHF/pull/524#issuecomment-2501505023
A demo for reproduction before the fix is applied:
```python
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import deepspeed
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
torch.cuda.set_device('cuda:0')
```
Secondly, https://pytorch.org/docs/stable/notes/cuda.html
When assessing the availability of CUDA in a given environment
(is_available()), PyTorch’s default behavior is to call the CUDA Runtime
API method cudaGetDeviceCount. Because this call in turn initializes the
CUDA Driver API (via cuInit) if it is not already initialized,
subsequent forks of a process that has run is_available() will fail with
a CUDA initialization error.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>