pytorch
ad21890f - [c10d] Scalable PG initiation. (#99931)

Commit
2 years ago
[c10d] Scalable PG initiation. (#99931) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: | World Size | Time (in secs) | | --- | ----------- | | 4 | 0.12 | | 8 | 0.25 | | 16 | 0.51 | | 32 | 0.87 | | 64 | 1.50 | | 128 | 2.87 | new_group use_local_synchronization=True: | World Size | Time (in secs) | | --- | ----------- | | 4 | 0.05 | | 8 | 0.04 | | 16 | 0.03 | | 32 | 0.03 | | 64 | 0.04 | | 128 | 0.04 | Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell
Author
Rodrigo Kumpera
Committer
Parents
Loading