[DDP Comm Hook] Re-enable the optimization of fusing copy and division when no comm hook is specified (#61379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61379
The optimization was accidentally removed in https://github.com/pytorch/pytorch/pull/59574
This optimization can help save a scan over all the input parameters, by fusing copy and div operations.
Now the default temporary hook is allreduce by sum, and no extra division is done inside the hook.
ghstack-source-id: 133288529
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_non_default_stream
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_sparse_gradient
buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_once
buck test mode/dev-nosan caffe2/test/distributed:c10 -- test_ddp_checkpointing_twice
Reviewed By: rohan-varma
Differential Revision: D29597614
fbshipit-source-id: 2434e4fd4e6abad7871cfe47886fe97b6e4ba28f