Add Zenflow code for Stage 1 & 2 (#7391)
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.
Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included
Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.
---------
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Guokai Ma <guokai.ma@gmail.com>