pytorch
eb391994 - [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)

Commit View On GitHub

Commit

194 days ago

[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501

Author

Aidyn-A

Committer

pytorchmergebot

Parents

752d164b

pytorch eb391994 - [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)

Commit

pytorch
eb391994 - [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)