[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270
Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.
This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
baseline: f248001754
batched PowerSGD: f246960752
The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35
Reviewed By: rohan-varma
Differential Revision: D26077709
fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5