pytorch
1d2382f1 - [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)

Commit View On GitHub

Commit

228 days ago

[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662) **Summary** The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. **Key Logic** 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. **Bucketing** The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab

Author

fegin

Committer

pytorchmergebot

Parents

113506d2

pytorch 1d2382f1 - [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)

Commit

pytorch
1d2382f1 - [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)