SemanticDiff

pytorch
aea8ee1f - Fix NCCL/Gloo process groups and DDP stream sync bug (#18465)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

5 years ago

Fix NCCL/Gloo process groups and DDP stream sync bug (#18465) Summary: DDP with NCCL backend uses a [worker stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/csrc/distributed/c10d/ddp.cpp#L142) to flatten grand batch tensors, and passes the flattened tensor to [another stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/lib/c10d/ProcessGroupNCCL.cpp#L379) to conduct ncclAllReduce. The flattened tensor has to record the ncclAllReduce stream, otherwise multiple streams might access the same memory space. cc ppwwyyxx Pull Request resolved: https://github.com/pytorch/pytorch/pull/18465 Differential Revision: D14613449 Pulled By: mrshenli fbshipit-source-id: b62773732552d12cc87b7adeb6897e9e11753ea9

Author

mrshenli

mrshenli

Committer

facebook-github-bot

facebook-github-bot

Parents

FAQ Terms Privacy Refunds Impressum

Loading