[Gradient Compression] PowerSGD comm hook (#48060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48060
Implement a PowerSGD variant that applies to a batched flattened tensor with zero paddings.
This version does not require handling 1D tensors and multi-dimenionsal tensors in the input separately, and hence it does not need to create two asyncrhonous future chains.
Potential optimizations:
1) Consider FP16 compression throughout PowerSGD.
2) Warm start and save one matrix multiplication per ieration.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117105938
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
Reviewed By: jiayisuse
Differential Revision: D24843692
fbshipit-source-id: f44200b1fd6e12e829fc543d21ab7ae086769561