DeepSpeed
729df6ca - fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) (#7906)

Commit
8 days ago
fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) (#7906) Fix [issue #7905](https://github.com/deepspeedai/DeepSpeed/issues/7905) - Preserve optimizer param-group metadata across ZeRO-3 subgroup splitting so SuperOffload handles multiple optimizer groups correctly. - Switch the CPU worker path to shared CPU parameter and gradient buffers, removing the need to send updated parameters back through the result queue. - Make the GPU-to-CPU gradient copy asynchronous and submit CPU optimizer work only after the copy is ready. The figures below compare per-iteration time and GPU memory usage against the non-offload. The second figure presents a correctness check of the updated version. <img width="977" height="364" alt="image" src="https://github.com/user-attachments/assets/8fb2cf21-1a8c-47dd-9090-ec73acc5c9dc" /> <img width="3248" height="1748" alt="image" src="https://github.com/user-attachments/assets/d8121d64-dfd9-478c-87ea-b41e98630a2a" /> --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Author
Parents
Loading