pytorch
c9570e4b - [checkpoint] Synchronize error handling across all ranks (#77091)

Commit
2 years ago
[checkpoint] Synchronize error handling across all ranks (#77091) Introduce error handling across all ranks when loading and saving checkpoints. This makes it a lot simpler for users to handle failures and, as a positive side-effect, coordination of when it successfully finished. This change requires 3 collectives when saving and 1 when loading. All those collectives carry a small payload so they will be latency bound and write time should dominate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77091 Approved by: https://github.com/pritamdamania87, https://github.com/wanchaol
Author
Rodrigo Kumpera
Committer
Parents
Loading