pytorch
403f5970 - Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)

Commit View On GitHub

Commit

3 years ago

Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814) Summary: I appreciate https://github.com/pytorch/pytorch/pull/61379, which restores the fusion of div-by-world-size and copy-to-allreduce-buffer for dense gradients. But i noticed in the wake of https://github.com/pytorch/pytorch/pull/61379 there's misaligned treatment of dense and sparse gradients. Specifically, dense gradients are dived by world size before the allreduce, and sparse gradients are dived by world size after the allreduce. On paper you wouldn't expect that to matter, but for cluster-scale DDP training with amp gradient scaling and allreduces of FP16 grads, we've noticed several cases where postdividing grads by world size caused nonconvergence while predividing worked. I'm not aware of any cases where the reverse was true. This PR changes the treatment of sparse gradients to match the treatment of dense gradients (both will be dived by world size before allreduce). Pull Request resolved: https://github.com/pytorch/pytorch/pull/61814 Reviewed By: mrshenli Differential Revision: D29772444 Pulled By: rohan-varma fbshipit-source-id: 033a17d5c019511889d908876282c6624fb26a2d

Author

mcarilli

Committer

facebook-github-bot

Parents

17d743ff

pytorch 403f5970 - Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)

Commit

pytorch
403f5970 - Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)