DeepSpeed
e6e8c137 - Fix moe cpu offload (#5220)

Commit
1 year ago
Fix moe cpu offload (#5220) The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses https://github.com/microsoft/DeepSpeed/issues/5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Parents
Loading