pytorch
acc13ac8 - [PyTorch] Make DDP reducer work under distributed autograd (#37998)

Commit View On GitHub

Commit

4 years ago

[PyTorch] Make DDP reducer work under distributed autograd (#37998) Summary: ## Why doesn’t DDP work under dist_autograd? DDP follows the steps below 1. [DDP Python constructor](https://github.com/pytorch/pytorch/blob/8d6a8d2b3fd2a6ec788378843fc518824acf274b/torch/nn/parallel/distributed.py#L389-L393) (on a module) creates a [C++ Reducer](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp), which holds references to all parameters (or variables in C++ code). 2. The reducer installs a post hook on each model parameter. 3. The backward run starts and triggers the post hooks installed above. 4. The post hook of a parameter simply marks the parameter ready for all-reduce. 5. Once all parameters in a bucket are ready, an all-reduce process starts by reading variable `.grad` and writes to variable `.grad`. But under dist_autograd, `.grad` of a variable is not populated at all. Instead, grads are in a global map in distributed context from variables to their grads. ## Solution of this PR The distributed engine to set a thread_local variable in a backward run indicating we're running in distributed mode. DDP reducer can then appropriately use `.grad` or the distributed context based on the thread local. More precisely, the thread local is set before calling the post hooks installed by the DDP reducer so that DDP post hooks can retrieve this thread local. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37998 Test Plan: ``` python test/distributed/test_ddp_under_dist_autograd.py ``` FB repo ``` buck test caffe2/test/distributed/... ``` DDP accuracy benchmark workflow run ``` flow-cli canary pytorch.benchmark.accuracy_comparison.workflow --parameters-json '{"node_world_size": 4, "dist_backend": "nccl"}' --run-as-secure-group fblearner_flow --entitlement gpu_prod ``` f196173157 Reviewed By: pritamdamania87 Differential Revision: D21513795 Pulled By: hczhu fbshipit-source-id: fe21e68ecdc9274182db4d4bb5a1e2d68ef927a2

Author

HC Zhu

Committer

facebook-github-bot

Parents

7cb4eae8

pytorch acc13ac8 - [PyTorch] Make DDP reducer work under distributed autograd (#37998)

Commit

pytorch
acc13ac8 - [PyTorch] Make DDP reducer work under distributed autograd (#37998)