DeepSpeed
Fix DecoupledCheckpointEngine deadlock and improve reliability
#7742
Merged

Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Rakshit-gen
Rakshit-gen Fix decoupled checkpoint deadlock
543a933f
Rakshit-gen Rakshit-gen requested a review from tjruwase tjruwase 112 days ago
sfc-gh-truwase Merge branch 'master' into fix/decoupled-checkpoint-deadlock
966df3de
sfc-gh-truwase
sfc-gh-truwase
Rakshit-gen
Rakshit-gen
sfc-gh-truwase
Rakshit-gen
Rakshit-gen
sfc-gh-truwase Merge branch 'master' into fix/decoupled-checkpoint-deadlock
3ed642fd
sfc-gh-truwase
sfc-gh-truwase commented on 2025-12-22
Rakshit-gen Fix decoupled checkpoint deadlock
53b46b8d
sfc-gh-truwase
sfc-gh-truwase commented on 2025-12-22
sfc-gh-truwase
sfc-gh-truwase commented on 2025-12-22
Rakshit-gen Fixed - _check_process_alive() now only checks self.ckpt_process.is_a…
b33cfbdc
Rakshit-gen Added assert after line 189 and reverted the print condition back to …
af310d1b
sfc-gh-truwase
sfc-gh-truwase approved these changes on 2025-12-22
sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) 110 days ago
sfc-gh-truwase sfc-gh-truwase merged ba29fdca into master 110 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone