[Gradient Compression] Warm-start of PowerSGD (#49451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451
Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible.
This can give a better compression performance in terms of both accuracy and speed.
Also add a unit test for batched PowerSGD to test_c10d.py.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119014132
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook
Reviewed By: rohan-varma
Differential Revision: D25583086
fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee