onnxruntime
58f53966 - Add Distributed Checkpointing support (#3639)

Commit
5 years ago
Add Distributed Checkpointing support (#3639) * Change naming of moments to Moment_x_<weight_name> * Add checkpointing code and zero checkpoint aggregation * Correct aggregation for LAMB, cleanup * Add simple checkpointing test * Add test for zero checkpoint aggregation * Fix tests * fix test * Review changes * Fix test after review comment fix * Fix API, test * Fix test after API change * Decouple save load from ORTTrainer * Add flag to not break checkpointing with ORTModel' Co-authored-by: aishwarya bhandare <aibhanda@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Author
Parents
Loading