Fix DecoupledCheckpointEngine deadlock and improve reliability (#7742)
## Summary
- Add timeouts to prevent indefinite hangs when checkpoint process
crashes
- Replace assertion with proper runtime validation
- Add process health checks before blocking operations
- Implement graceful cleanup with escalating termination
Fixes #7741
## Description
The `DecoupledCheckpointEngine` had several critical issues that could
cause training jobs to hang indefinitely:
1. **No timeout on `save_event.wait()`**: If the checkpoint process
died, training would hang forever waiting for an event that would never
fire.
2. **No timeout on `ckpt_process.join()`**: If the process crashed or
hung, cleanup would block indefinitely.
3. **Assertion used for runtime validation**: `assert info ==
self.commit_info` is disabled with `python -O`, allowing silent data
corruption in production.
4. **`__del__` could hang**: The destructor called `cleanup()` which
could block, causing program hang on exit.
## Changes
- Add `_wait_for_event_with_timeout()` that checks process health every
10 seconds while waiting
- Add `_check_process_alive()` helper to validate process state before
blocking operations
- Replace `assert` with proper `if/raise ValueError` for commit info
validation
- Add timeout to `join()` with escalating termination: `join()` →
`terminate()` → `kill()`
- Wrap `__del__` in try/except and add `_cleanup_called` flag to prevent
multiple cleanup calls
- Add health checks in `save()` and `commit()` before queue operations
## Test plan
- [ ] Verify normal checkpoint save/load still works
- [ ] Test behavior when checkpoint process is killed mid-save
- [ ] Verify timeout triggers after 5 minutes of no response
- [ ] Confirm graceful shutdown with Ctrl+C during checkpoint
- [ ] Test with `python -O` to verify validation still works
---------
Signed-off-by: Rakshit-gen <sisodiarakshit456@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>