DeepSpeed
5aef6d88 - Avoid CUDA context initialization during op compatibility checks at import (#8078)

Commit

4 days ago

Avoid CUDA context initialization during op compatibility checks at import (#8078) ## Summary `import deepspeed` initialized a CUDA context in the parent process, which permanently breaks `fork()`-based multiprocessing (`Cannot re-initialize CUDA in forked subprocess`). This makes importing DeepSpeed fork-safe. Fixes #7918. ## Root cause On a GPU box, `import deepspeed` reached **three** distinct calls that create a CUDA context, each gated differently (which is why a single patch kept missing one): 1. **`torch.cuda.is_available()`** — called during accelerator auto-detection (`real_accelerator.py`) and in every CUDA op builder's `is_compatible()`. By default it runs `cudaGetDeviceCount → cuInit`, creating a context. Per the [PyTorch docs](https://docs.pytorch.org/docs/stable/generated/torch.cuda.is_available.html) this is only avoided with `PYTORCH_NVML_BASED_CUDA_CHECK=1`. Note it does **not** set `torch.cuda.is_initialized()`, so an import-time `assert not is_initialized()` is a false-green. 2. **`torch.cuda.get_device_properties(0)`** — in the eight builders' `is_compatible()` (run at import by `git_version_info.py`); triggers `torch.cuda._lazy_init()`. 3. **`is_triton_supported()` → `torch.cuda.get_device_capability()`** — called at module import in `ds_transformer.py`, gated on `deepspeed.HAS_TRITON`. This only fires when **triton is installed**, so it was invisible in triton-less environments — but it was the first initializer on a real GPU node. ## Fix 1. `deepspeed/__init__.py` sets `os.environ.setdefault("PYTORCH_NVML_BASED_CUDA_CHECK", "1")` as the very first statement, so `torch.cuda.is_available()` uses the NVML-based check and never initializes a context. `setdefault()` preserves an explicit user setting. 2. `CUDAOpBuilder.cuda_capability_major()` (in `op_builder/builder.py`) reads compute capability only when a context already exists (`is_initialized()`) and we are not in a forked child (`_is_in_bad_fork()`, mirroring #7977); otherwise returns `None`. All eight builders route through it and skip the capability gate when probing is unsafe. 3. `ds_transformer.py` imports the triton kernels whenever triton is installed (`if deepspeed.HAS_TRITON:`) instead of also gating on `is_triton_supported()`. The capability probe is removed from import; actual triton use stays gated at runtime by `config.use_triton`, where CUDA is already initialized. ## Behavior / tradeoff - NVML-based availability is a slightly weaker assessment than the default runtime check and falls back to `cudaGetDeviceCount` if NVML is unavailable (documented PyTorch behavior); a non-issue on standard NVIDIA boxes. - Dropping the import-time capability gate means triton kernel modules are imported whenever triton is installed (even on pre-Ampere). Importing them has no CUDA side effects; their use is still gated by `config.use_triton`. ## Tests - Three unit tests for `cuda_capability_major()`'s decision tree (not-initialized → skip, initialized → probe, bad-fork → skip), mocked `torch.cuda`, no GPU required. - `test_forked_child_can_use_cuda_after_importing_deepspeed` — forks after `import deepspeed`, the child runs a real CUDA op, parent asserts success. ## Validation Verified on a CUDA GPU node (NVIDIA, torch 2.4.1+cu121). After `import deepspeed`: - `torch.cuda.is_initialized()` → `False` - a forked child runs `torch.ones(1, device="cuda")` successfully (exit 0) - instrumenting `torch.cuda._lazy_init` shows **0** distinct import-time CUDA-touch sites (down from the `ds_transformer.py:17` initializer + its downstream builder probe). ## Docs Updated `CONTRIBUTING.md` and `docs/contributing.md`: `--forked` is safe now that `import deepspeed` no longer initializes CUDA. cc @tjruwase @loadams @tohtana --------- Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>

References

#8078 - Avoid CUDA context initialization during op compatibility checks at import

Author

Achyuthan-S

Parents

10f76c2e

DeepSpeed 5aef6d88 - Avoid CUDA context initialization during op compatibility checks at import (#8078)

DeepSpeed
5aef6d88 - Avoid CUDA context initialization during op compatibility checks at import (#8078)