pytorch
3137bbeb - [Reland][DDP] Merge work and future_work in reducer (#59520)

Commit
3 years ago
[Reland][DDP] Merge work and future_work in reducer (#59520) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520 Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. #Original PR Issue: https://github.com/pytorch/pytorch/issues/41266 ghstack-source-id: 130685351 Test Plan: buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view Reviewed By: walterddr Differential Revision: D28922305 fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c
Author
Yi Wang
Parents
Loading