Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL (#7727)
_**What this PR does**_
- This PR fixes an occasional deadlock / hang when using DeepSpeed Async
I/O (AIO) for NVMe swap-in/swap-out
- The hang happens inside aio_handle.wait() where training can stall
forever.
_**Reproduction**_
[ds_config.json](https://github.com/user-attachments/files/24179010/ds_config.json)
[finetune_zero3.py](https://github.com/user-attachments/files/24179011/finetune_zero3.py)
Steps
1. Replace {NVME_PATH} in ds_config.json with a valid NVMe mount path on
your cluster.
2. Build/install DeepSpeed with AIO enabled: `DS_BUILD_AIO=1 pip install
--no-build-isolation .`
3. Run: `CUDA_VISIBLE_DEVICES=0 deepspeed finetune_zero3.py`
_**Fix:**_
Release the Python GIL while aio_handle.wait() is blocking by adding a
pybind11 call guard (py::gil_scoped_release) to the wait() binding.
_**Why this is needed (root cause)**_
Two threads are involved:
- Python main thread: calls aio_handle.wait() and blocks until all async
I/O operations complete.
- AIO worker thread(s): perform the actual file I/O in the background.
In some cases, after an I/O operation completes, the worker thread
triggers cleanup of PyTorch tensors (e.g., decref / refcount updates for
Python-backed objects). That cleanup path may require acquiring the
Python GIL.
**Before this PR:**
- The Python main thread enters aio_handle.wait() while still holding
the GIL.
- wait() blocks, waiting for the worker thread(s) to finish.
- A worker thread completes an I/O op and reaches a cleanup path that
attempts to acquire the GIL.
- The worker thread cannot acquire the GIL because it is held by the
Python thread blocked in wait().
- Result: the Python thread is waiting for the worker, and the worker is
waiting for the GIL → deadlock.