Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)
Summary:
I appreciate https://github.com/pytorch/pytorch/pull/61379, which restores the fusion of div-by-world-size and copy-to-allreduce-buffer for dense gradients. But i noticed in the wake of https://github.com/pytorch/pytorch/pull/61379 there's misaligned treatment of dense and sparse gradients. Specifically, dense gradients are dived by world size before the allreduce, and sparse gradients are dived by world size after the allreduce. On paper you wouldn't expect that to matter, but for cluster-scale DDP training with amp gradient scaling and allreduces of FP16 grads, we've noticed several cases where postdividing grads by world size caused nonconvergence while predividing worked. I'm not aware of any cases where the reverse was true.
This PR changes the treatment of sparse gradients to match the treatment of dense gradients (both will be dived by world size before allreduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61814
Reviewed By: mrshenli
Differential Revision: D29772444
Pulled By: rohan-varma
fbshipit-source-id: 033a17d5c019511889d908876282c6624fb26a2d