DeepSpeed
ba29fdca - Fix DecoupledCheckpointEngine deadlock and improve reliability (#7742)

Commit
2 days ago
Fix DecoupledCheckpointEngine deadlock and improve reliability (#7742) ## Summary - Add timeouts to prevent indefinite hangs when checkpoint process crashes - Replace assertion with proper runtime validation - Add process health checks before blocking operations - Implement graceful cleanup with escalating termination Fixes #7741 ## Description The `DecoupledCheckpointEngine` had several critical issues that could cause training jobs to hang indefinitely: 1. **No timeout on `save_event.wait()`**: If the checkpoint process died, training would hang forever waiting for an event that would never fire. 2. **No timeout on `ckpt_process.join()`**: If the process crashed or hung, cleanup would block indefinitely. 3. **Assertion used for runtime validation**: `assert info == self.commit_info` is disabled with `python -O`, allowing silent data corruption in production. 4. **`__del__` could hang**: The destructor called `cleanup()` which could block, causing program hang on exit. ## Changes - Add `_wait_for_event_with_timeout()` that checks process health every 10 seconds while waiting - Add `_check_process_alive()` helper to validate process state before blocking operations - Replace `assert` with proper `if/raise ValueError` for commit info validation - Add timeout to `join()` with escalating termination: `join()` → `terminate()` → `kill()` - Wrap `__del__` in try/except and add `_cleanup_called` flag to prevent multiple cleanup calls - Add health checks in `save()` and `commit()` before queue operations ## Test plan - [ ] Verify normal checkpoint save/load still works - [ ] Test behavior when checkpoint process is killed mid-save - [ ] Verify timeout triggers after 5 minutes of no response - [ ] Confirm graceful shutdown with Ctrl+C during checkpoint - [ ] Test with `python -O` to verify validation still works --------- Signed-off-by: Rakshit-gen <sisodiarakshit456@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Author
Parents
Loading