SemanticDiff

pytorch
c9570e4b - [checkpoint] Synchronize error handling across all ranks (#77091)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

2 years ago

[checkpoint] Synchronize error handling across all ranks (#77091) Introduce error handling across all ranks when loading and saving checkpoints. This makes it a lot simpler for users to handle failures and, as a positive side-effect, coordination of when it successfully finished. This change requires 3 collectives when saving and 1 when loading. All those collectives carry a small payload so they will be latency bound and write time should dominate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77091 Approved by: https://github.com/pritamdamania87, https://github.com/wanchaol

Author

Rodrigo Kumpera

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading