pytorch
769b9381 - DDP Communication hook: Fix the way we pass future result to buckets. (#43307)

Commit View On GitHub

Commit

4 years ago

DDP Communication hook: Fix the way we pass future result to buckets. (#43307) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. ghstack-source-id: 110728299 Test Plan: Run `python test/distributed/test_c10d.py`, some perf&accuracy benchmarks. New tests: `test_accumulate_gradients_no_sync_allreduce_hook` `test_accumulate_gradients_no_sync_allreduce_with_then_hook` Acc benchmark results look okay: f214188350 Reviewed By: agolynski Differential Revision: D23229309 fbshipit-source-id: 329470036cbc05ac12049055828495fdb548a082

Author

sinannasir

Committer

facebook-github-bot

Parents

0521c712

pytorch 769b9381 - DDP Communication hook: Fix the way we pass future result to buckets. (#43307)

Commit

pytorch
769b9381 - DDP Communication hook: Fix the way we pass future result to buckets. (#43307)