SemanticDiff

pytorch
388ba7e5 - [ptd] make multithreaded pg wait for readiness before the 1st collective (#106954)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

1 year ago

[ptd] make multithreaded pg wait for readiness before the 1st collective (#106954) Summary: This used to be not a problem because in c10d collective init, a store based barrier would be applied. This recently got changed in https://github.com/pytorch/pytorch/pull/103033 where the barrier is not by default applied. For normal PGs like gloo/nccl, this is not a problem as the rendezvous process is implicitly a barrier anyway. But for threaded pg, without the store based barrier this would lead to race condition as the local pg does not wait for world to be ready before starting collectives. This fixes the issue by just doing a store based barrier for each pg created. The CV attempt wouldn't work since that would still rely on class level variables which would break in the device mesh case. See inline comment for details. Differential Revision: D48220125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106954 Approved by: https://github.com/wanchaol, https://github.com/H-Huang, https://github.com/XilunWu

Author

xunnanxu

xunnanxu

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading