DDP Communication hook: Fix the way we pass future result to buckets. (#43307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43307
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.
Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
// bucket_views_in[i].copy_(grad) and
// grad.copy_(bucket_views_out[i])
// provide convenient ways to move grad data in/out of contents.
std::vector<at::Tensor> bucket_views_in;
std::vector<at::Tensor> bucket_views_out;
```
I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.
ghstack-source-id: 110728299
Test Plan:
Run `python test/distributed/test_c10d.py`, some perf&accuracy benchmarks.
New tests:
`test_accumulate_gradients_no_sync_allreduce_hook`
`test_accumulate_gradients_no_sync_allreduce_with_then_hook`
Acc benchmark results look okay:
f214188350
Reviewed By: agolynski
Differential Revision: D23229309
fbshipit-source-id: 329470036cbc05ac12049055828495fdb548a082