DeepSpeed
2bc16e21 - Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL (#7727)

Commit
7 days ago
Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL (#7727) _**What this PR does**_ - This PR fixes an occasional deadlock / hang when using DeepSpeed Async I/O (AIO) for NVMe swap-in/swap-out - The hang happens inside aio_handle.wait() where training can stall forever. _**Reproduction**_ [ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json) [finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py) Steps 1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on your cluster. 2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install --no-build-isolation .` 3. Run: `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py` _**Fix:**_ Release the Python GIL while aio_handle.wait() is blocking by adding a pybind11 call guard (py::gil_scoped_release) to the wait() binding. _**Why this is needed (root cause)**_ Two threads are involved: - Python main thread: calls aio_handle.wait() and blocks until all async I/O operations complete. - AIO worker thread(s): perform the actual file I/O in the background. In some cases, after an I/O operation completes, the worker thread triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for Python-backed objects). That cleanup path may require acquiring the Python GIL. **Before this PR:** - The Python main thread enters aio_handle.wait() while still holding the GIL. - wait() blocks, waiting for the worker thread(s) to finish. - A worker thread completes an I/O op and reaches a cleanup path that attempts to acquire the GIL. - The worker thread cannot acquire the GIL because it is held by the Python thread blocked in wait(). - Result: the Python thread is waiting for the worker, and the worker is waiting for the GIL → deadlock.
Author
Parents
Loading