fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) #7906
fix(superoffload): preserve param group mapping across ZeRO-3 subgroups
a26e9af4
perf(superoffload): use shared cpu buffers for worker updates
f35fcc87
fix format issue
3d8ecb59
fix(superoffload): Avoid replaying cleared gradients in superoffload …
2895cd1f
fix(superoffload): fix llava training failures caused by non-contiguo…
169f4b32
fix format issue
98e8f6d9
break the group size and the reduce size
6855b083
xylian86
marked this pull request as draft 24 days ago
refactor(superoffload): rewrite gradient reduction
6ce2a69c
fix(superoffload): rewrite partition_grads and fix gradient race cond…
fb986104
clean
3d048998
clean
17a3ac5a
device validaton
05d7f53e
Fix SuperOffload loss divergence: add gradient accumulation across mi…
00732eda
xylian86
marked this pull request as ready for review 24 days ago
fix format issue
7da1d33c
Merge branch 'master' into xinyu/issue7905
e0e66a25
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub