onnxruntime
Add Distributed Checkpointing support
#3639
Merged

Add Distributed Checkpointing support #3639

ashbhandare merged 15 commits into master from aibhanda/distributed_checkpoint
ashbhandare
ashbhandare ashbhandare requested a review 5 years ago
ashbhandare ashbhandare added training
ashbhandare ashbhandare requested a review from thiagocrepaldi thiagocrepaldi 5 years ago
ashbhandare ashbhandare requested a review from jessebenson jessebenson 5 years ago
ashbhandare ashbhandare requested a review from SherlockNoMad SherlockNoMad 5 years ago
ashbhandare ashbhandare force pushed from 29463ea9 to 257086a6 5 years ago
thiagocrepaldi
thiagocrepaldi requested changes on 2020-04-23
thiagocrepaldi
jessebenson
jessebenson commented on 2020-04-24
ashbhandare ashbhandare changed the base branch from ort_training to master 5 years ago
jessebenson
jessebenson dismissed these changes on 2020-04-28
ashbhandare ashbhandare dismissed their stale review 5 years ago
Change naming of moments to Moment_x_<weight_name>
0e0ef3e3
Modify zero test to hit bert like scenario.
22be6146
ashbhandare Revert "Modify zero test to hit bert like scenario."
0e4334f0
ashbhandare Add checkpointing code and zero checkpoint aggregation
ad10dcba
ashbhandare Correct aggregation for LAMB, cleanup
580788e3
ashbhandare Add simple checkpointing test
89f33ff5
ashbhandare Add test for zero checkpoint aggregation
fc05df80
ashbhandare Fix tests
57ddebfe
ashbhandare fix test
3e366524
ashbhandare Review changes
415295f7
ashbhandare Fix test after review comment fix
7f1540fb
ashbhandare Fix API, test
ee759f3f
ashbhandare ashbhandare force pushed to ee759f3f 5 years ago
ashbhandare Fix test after API change
d804b7ed
ashbhandare ashbhandare removed review request from SherlockNoMad SherlockNoMad 5 years ago
ashbhandare Decouple save load from ORTTrainer
e1d13a27
thiagocrepaldi
thiagocrepaldi requested changes on 2020-04-28
ashbhandare Add flag to not break checkpointing with ORTModel'
4e885090
thiagocrepaldi
thiagocrepaldi approved these changes on 2020-04-28
thiagocrepaldi
thiagocrepaldi approved these changes on 2020-04-29
ashbhandare ashbhandare merged 58f53966 into master 5 years ago
ashbhandare ashbhandare deleted the aibhanda/distributed_checkpoint branch 5 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone