DeepSpeed
ZeRO3: Improve mismatch detection
#7525
Merged

ZeRO3: Improve mismatch detection #7525

sfc-gh-truwase
sfc-gh-truwase Detect list len mismatches
641d86fa
sfc-gh-truwase sfc-gh-truwase requested a review from stas00 stas00 123 days ago
sfc-gh-truwase sfc-gh-truwase requested a review from tjruwase tjruwase 123 days ago
sfc-gh-truwase sfc-gh-truwase requested a review from tohtana tohtana 123 days ago
sfc-gh-truwase Revert
b0b6bd6c
sfc-gh-truwase Z3 sanity check option
cbf3d661
sfc-gh-truwase Revert
ceef8756
tohtana
tohtana commented on 2025-08-29
tohtana
tohtana commented on 2025-08-29
sfc-gh-truwase Minor tweaks
1a11c187
sfc-gh-truwase Improve error message format
ffdccf26
sfc-gh-truwase Improve error message format
498e69c2
stas00
stas00 approved these changes on 2025-08-29
sfc-gh-truwase Update deepspeed/runtime/zero/utils.py
d6b3b74d
sfc-gh-truwase Update deepspeed/runtime/engine.py
5aad5745
sfc-gh-truwase PR feedback
0b6145fc
sfc-gh-truwase Add list length
948f7775
sfc-gh-truwase Merge branch 'master' into sfc-gh-truwase/detect_z3_state_mismatch
05f1e970
sfc-gh-truwase sfc-gh-truwase merged eabb687a into master 121 days ago
sfc-gh-truwase sfc-gh-truwase deleted the sfc-gh-truwase/detect_z3_state_mismatch branch 121 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone