Fix moe cpu offload (#5220)
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.
This PR addresses https://github.com/microsoft/DeepSpeed/issues/5203
---------
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>