Avoid CUDA context initialization during op compatibility checks at import (#8078)
## Summary
`import deepspeed` initialized a CUDA context in the parent process,
which permanently breaks `fork()`-based multiprocessing (`Cannot
re-initialize CUDA in forked subprocess`). This makes importing
DeepSpeed fork-safe.
Fixes #7918.
## Root cause
On a GPU box, `import deepspeed` reached **three** distinct calls that
create a CUDA context, each gated differently (which is why a single
patch kept missing one):
1. **`torch.cuda.is_available()`** — called during accelerator
auto-detection (`real_accelerator.py`) and in every CUDA op builder's
`is_compatible()`. By default it runs `cudaGetDeviceCount → cuInit`,
creating a context. Per the [PyTorch
docs](https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html)
this is only avoided with `PYTORCH_NVML_BASED_CUDA_CHECK=1`. Note it
does **not** set `torch.cuda.is_initialized()`, so an import-time
`assert not is_initialized()` is a false-green.
2. **`torch.cuda.get_device_properties(0)`** — in the eight builders'
`is_compatible()` (run at import by `git_version_info.py`); triggers
`torch.cuda._lazy_init()`.
3. **`is_triton_supported()` → `torch.cuda.get_device_capability()`** —
called at module import in `ds_transformer.py`, gated on
`deepspeed.HAS_TRITON`. This only fires when **triton is installed**, so
it was invisible in triton-less environments — but it was the first
initializer on a real GPU node.
## Fix
1. `deepspeed/__init__.py` sets
`os.environ.setdefault("PYTORCH_NVML_BASED_CUDA_CHECK", "1")` as the
very first statement, so `torch.cuda.is_available()` uses the NVML-based
check and never initializes a context. `setdefault()` preserves an
explicit user setting.
2. `CUDAOpBuilder.cuda_capability_major()` (in `op_builder/builder.py`)
reads compute capability only when a context already exists
(`is_initialized()`) and we are not in a forked child
(`_is_in_bad_fork()`, mirroring #7977); otherwise returns `None`. All
eight builders route through it and skip the capability gate when
probing is unsafe.
3. `ds_transformer.py` imports the triton kernels whenever triton is
installed (`if deepspeed.HAS_TRITON:`) instead of also gating on
`is_triton_supported()`. The capability probe is removed from import;
actual triton use stays gated at runtime by `config.use_triton`, where
CUDA is already initialized.
## Behavior / tradeoff
- NVML-based availability is a slightly weaker assessment than the
default runtime check and falls back to `cudaGetDeviceCount` if NVML is
unavailable (documented PyTorch behavior); a non-issue on standard
NVIDIA boxes.
- Dropping the import-time capability gate means triton kernel modules
are imported whenever triton is installed (even on pre-Ampere).
Importing them has no CUDA side effects; their use is still gated by
`config.use_triton`.
## Tests
- Three unit tests for `cuda_capability_major()`'s decision tree
(not-initialized → skip, initialized → probe, bad-fork → skip), mocked
`torch.cuda`, no GPU required.
- `test_forked_child_can_use_cuda_after_importing_deepspeed` — forks
after `import deepspeed`, the child runs a real CUDA op, parent asserts
success.
## Validation
Verified on a CUDA GPU node (NVIDIA, torch 2.4.1+cu121). After `import
deepspeed`:
- `torch.cuda.is_initialized()` → `False`
- a forked child runs `torch.ones(1, device="cuda")` successfully (exit
0)
- instrumenting `torch.cuda._lazy_init` shows **0** distinct import-time
CUDA-touch sites (down from the `ds_transformer.py:17` initializer + its
downstream builder probe).
## Docs
Updated `CONTRIBUTING.md` and `docs/contributing.md`: `--forked` is safe
now that `import deepspeed` no longer initializes CUDA.
cc @tjruwase @loadams @tohtana
---------
Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>