pytorch
19a0eb4c - [c10d] Monitored barrier: option to collect all failed ranks (#55010)

Commit
3 years ago
[c10d] Monitored barrier: option to collect all failed ranks (#55010) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010 Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one. This is done by passing in a flag `wait_all_ranks=True`. ghstack-source-id: 125699839 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27447787 fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
Author
Parents
Loading