DeepSpeed
b01a0915 - fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak (#8005)

Commit
8 days ago
fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak (#8005) Fixes #8003 ## Summary `FastFileWriter._fini()` overwrote `self._aio_fd = INVALID_FD` without calling `os.close()`, leaking one fd per save. With unlink-based checkpoint rotation this stranded the unlinked inode in the ext4 orphan list, fs blocks were never reclaimed, and long-running save loops hit ENOSPC at iter ~60 (60 GB/iter on a 4 TB partition). This PR adds explicit `os.fsync()` + `os.close()` in `_fini()` and a regression test that asserts no `/proc/self/fd` entry points at a deleted file after a save+close+unlink cycle. ## Verification - 20-iteration repro of `save() / close() / unlink()` leaked 20 fds before the fix, 0 after. - 700-iter / 42 TB / 60 h endurance run on ext4/NVMe: `df_used` stable at 736 GB (drift +281 MB / 697 rotations) with the fix; same workload hit ENOSPC at iter ~60 without it. - Performance impact: ~5% wall-time overhead from the added `os.fsync()` at ~10 GB/s peak. ## Test plan - [x] New regression test `tests/unit/ops/aio/test_fast_file_writer_fd_close.py` verifies fd cleanup after a single save and after 5/20-iter rotation loops via `/proc/self/fd` scoped to `tmp_path`. - [x] Gated on `async_io` compatibility, Linux, and CUDA accelerator so unsupported CI matrix entries skip cleanly. - [x] Confirmed test FAILS without this PR's `_fini()` change and PASSES with it. - [x] `pre-commit run --files <changed files>` clean. ## Notes - The `__del__` assertion `assert self._aio_fd == INVALID_FD` passes even with the bug because it checks the Python attribute that `_fini` itself sets. The new test checks OS-level state via `/proc/self/fd`. - `os.fsync()` is included for post-close durability — required for correctness on the unaligned-tail path that re-opens the file as buffered I/O. If maintainers prefer to drop it for performance, removing only the `os.fsync(...)` line still fixes the leak. Happy to adjust shape, naming, or test placement to fit project conventions. Thanks for the review. Signed-off-by: jg-heo <csjg.heo@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author
Parents
Loading