Fix Pipe + DDP for unused parameters, static graph (#60118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118
Pipe + DDP has a few issues:
1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081
The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908.
to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29167283
fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8