[Gradient Compression] Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507
Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step.
Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps.
'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117483639
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view
Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs.
Example logs:
Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step.
{F346971916}
Reviewed By: rohan-varma
Differential Revision: D25191589
fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc