Resubmit: [Gradient Compression] Implement the original layerwise PowerSGD (#49639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639
Resubmit #49417 with a fix for distributed_test.
The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission.
The real diff is:
https://github.com/pytorch/pytorch/pull/49639/commits/4ca1014bb533b17b956a24d35507037196c64281
This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079
ghstack-source-id: 118969912
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、
Reviewed By: mrshenli
Differential Revision: D25654961
fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e