DeepSpeed
9d2660d2 - Fix the MoE-params gradient-scaling (#4957)

Commit

1 year ago

Fix the MoE-params gradient-scaling (#4957) This PR fixes a bug that I introduced in a previous [PR](https://github.com/microsoft/DeepSpeed/pull/4695). The MoE-Params' gradients got accidentally double-scaled due to passing `self.ipg_bucket_has_moe_params` to the all_reduce functions. Since, we have already done the scaling the MoE parameters [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage_1_and_2.py#L1054), we can safely pass `divide=False`. The divide argument may not be needed anymore, however, I just let it be there as I think it may be needed for the sequence-parallelism accuracy stability adjustments. cc: @tjruwase

References

#4957 - Fix the MoE-params gradient-scaling

Author

RezaYazdaniAminabadi

Parents

7fb5bade

DeepSpeed 9d2660d2 - Fix the MoE-params gradient-scaling (#4957)

DeepSpeed
9d2660d2 - Fix the MoE-params gradient-scaling (#4957)