Fix moe cpu offload (#5220)

Commit

1 year ago

Fix moe cpu offload (#5220) The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses https://github.com/microsoft/DeepSpeed/issues/5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>

References

#5220 - Fix moe cpu offload

Author

RezaYazdaniAminabadi

Parents

3e06a154

DeepSpeed e6e8c137 - Fix moe cpu offload (#5220)

DeepSpeed
e6e8c137 - Fix moe cpu offload (#5220)