[dist.checkpoint] Change metadata format and improve error reporting (#82078)
This PR implements the following changes.
Move to new checkpoint metadata format with split between logical and storage data.
This is a step in the direction of supporting extensible checkpointing as it moves us away from the hardcoded storage model enforced by the FileSystem storage layer.
Change CheckpointException to include exception traceback. Exception tracebacks are not serializable so we need to take care of that otherwise we provide horribly bad errors to users.
Finally, remove `validate_state_dict` as it lost its usefulness. Loading is becoming more and more flexible to the point that the only reasonable way to verify if it's possible to load a given configuration is to actually try it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82078
Approved by: https://github.com/wanchaol, https://github.com/fduwjj