[PyTorch] Make DDP reducer work under distributed autograd (#37998)
Summary:
## Why doesn’t DDP work under dist_autograd?
DDP follows the steps below
1. [DDP Python constructor](https://github.com/pytorch/pytorch/blob/8d6a8d2b3fd2a6ec788378843fc518824acf274b/torch/nn/parallel/distributed.py#L389-L393) (on a module) creates a [C++ Reducer](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp), which holds references to all parameters (or variables in C++ code).
2. The reducer installs a post hook on each model parameter.
3. The backward run starts and triggers the post hooks installed above.
4. The post hook of a parameter simply marks the parameter ready for all-reduce.
5. Once all parameters in a bucket are ready, an all-reduce process starts by reading variable `.grad` and writes to variable `.grad`.
But under dist_autograd, `.grad` of a variable is not populated at all. Instead, grads are in a global map in distributed context from variables to their grads.
## Solution of this PR
The distributed engine to set a thread_local variable in a backward run indicating we're running in distributed mode. DDP reducer can then appropriately use `.grad` or the distributed context based on the thread local. More precisely, the thread local is set before calling the post hooks installed by the DDP reducer so that DDP post hooks can retrieve this thread local.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37998
Test Plan:
```
python test/distributed/test_ddp_under_dist_autograd.py
```
FB repo
```
buck test caffe2/test/distributed/...
```
DDP accuracy benchmark workflow run
```
flow-cli canary pytorch.benchmark.accuracy_comparison.workflow --parameters-json '{"node_world_size": 4, "dist_backend": "nccl"}' --run-as-secure-group fblearner_flow --entitlement gpu_prod
```
f196173157
Reviewed By: pritamdamania87
Differential Revision: D21513795
Pulled By: hczhu
fbshipit-source-id: fe21e68ecdc9274182db4d4bb5a1e2d68ef927a2