[checkpoint] Synchronize error handling across all ranks (#77091)
Introduce error handling across all ranks when loading and saving checkpoints.
This makes it a lot simpler for users to handle failures and, as a positive side-effect, coordination of when it successfully finished.
This change requires 3 collectives when saving and 1 when loading.
All those collectives carry a small payload so they will be latency bound and write time should dominate it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77091
Approved by: https://github.com/pritamdamania87, https://github.com/wanchaol