fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) (#7906)
Fix [issue #7905](https://github.com/deepspeedai/DeepSpeed/issues/7905)
- Preserve optimizer param-group metadata across ZeRO-3 subgroup
splitting so SuperOffload handles multiple optimizer groups correctly.
- Switch the CPU worker path to shared CPU parameter and gradient
buffers, removing the need to send updated parameters back through the
result queue.
- Make the GPU-to-CPU gradient copy asynchronous and submit CPU
optimizer work only after the copy is ready.
The figures below compare per-iteration time and GPU memory usage
against the non-offload. The second figure presents a correctness check
of the updated version.
<img width="977" height="364" alt="image"
src="https://github.com/user-attachments/assets/8fb2cf21-1a8c-47dd-9090-ec73acc5c9dc"
/>
<img width="3248" height="1748" alt="image"
src="https://github.com/user-attachments/assets/d8121d64-dfd9-478c-87ea-b41e98630a2a"
/>
---------
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>