DeepSpeed
fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905)
#7906
Merged

fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) #7906

sfc-gh-truwase merged 15 commits into master from xinyu/issue7905
xylian86
xylian86 fix(superoffload): preserve param group mapping across ZeRO-3 subgroups
a26e9af4
xylian86 perf(superoffload): use shared cpu buffers for worker updates
f35fcc87
xylian86 xylian86 requested a review from tjruwase tjruwase 31 days ago
xylian86 xylian86 requested a review from tohtana tohtana 31 days ago
xylian86 fix format issue
3d8ecb59
chatgpt-codex-connector
chatgpt-codex-connector commented on 2026-03-15
xylian86 fix(superoffload): Avoid replaying cleared gradients in superoffload …
2895cd1f
xylian86 fix(superoffload): fix llava training failures caused by non-contiguo…
169f4b32
xylian86 fix format issue
98e8f6d9
sfc-gh-truwase
sfc-gh-truwase commented on 2026-03-16
sfc-gh-truwase
sfc-gh-truwase commented on 2026-03-16
sfc-gh-truwase
sfc-gh-truwase commented on 2026-03-16
xylian86 break the group size and the reduce size
6855b083
xylian86 xylian86 requested a review from loadams loadams 24 days ago
xylian86 xylian86 marked this pull request as draft 24 days ago
xylian86 refactor(superoffload): rewrite gradient reduction
6ce2a69c
xylian86 fix(superoffload): rewrite partition_grads and fix gradient race cond…
fb986104
xylian86 clean
3d048998
xylian86 clean
17a3ac5a
xylian86 device validaton
05d7f53e
xylian86 Fix SuperOffload loss divergence: add gradient accumulation across mi…
00732eda
xylian86 xylian86 marked this pull request as ready for review 24 days ago
xylian86 fix format issue
7da1d33c
chatgpt-codex-connector
chatgpt-codex-connector commented on 2026-03-22
sfc-gh-truwase Merge branch 'master' into xinyu/issue7905
e0e66a25
sfc-gh-truwase
sfc-gh-truwase approved these changes on 2026-03-28
sfc-gh-truwase sfc-gh-truwase merged 729df6ca into master 17 days ago
sfc-gh-truwase sfc-gh-truwase deleted the xinyu/issue7905 branch 17 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone