[Reland][DDP] Merge work and future_work in reducer (#59520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520
Remove `work` attribute from Reducer class in favor of `future_work`.
Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.
Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.
#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130685351
Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view
Reviewed By: walterddr
Differential Revision: D28922305
fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c