[Draft] Add ZeRO-3 elastic checkpoint save/load support #8031
Initial plan
001f77c3
Revert "fix: update 1 file reformatted."
b90aee5a
Merge pull request #5 from nathon-lee/copilot/git-revert-ff886701
b6da9afd
Merge branch 'deepspeedai:master' into master
bb7f64fd
Initial plan
cbc816c9
Reapply "fix: update 1 file reformatted."
5fcc9a7e
Merge pull request #6 from nathon-lee/copilot/remove-commits-from-master
f7c5d75d
Merge branch 'deepspeedai:master' into master
18efbcc3
Merge branch 'deepspeedai:master' into master
e2ac74d2
Merge branch 'deepspeedai:master' into master
da07382d
Merge branch 'deepspeedai:master' into master
5d8875cc
Merge branch 'deepspeedai:master' into master
316b6dda
Merge branch 'deepspeedai:master' into master
2020543f
Merge branch 'deepspeedai:master' into master
1a8694c6
Merge branch 'deepspeedai:master' into master
d6725be0
Merge branch 'deepspeedai:master' into master
a06c5487
Merge branch 'deepspeedai:master' into master
6959eb4b
Merge branch 'deepspeedai:master' into master
e88eb3e3
feat(zero): implement elastic checkpoint support for ZeRO-3
196f60c2
[ZeRO-3]: Implement elastic checkpoint save and load
773f32ea
[ZeRO-3]: Add unit tests for elastic checkpoint
9af52c28
fix: fix stage-3 elastic checkpoint cross-world-size save/load
3b084846
fix(zero3): move restored elastic optimizer state to correct device
b502ee11
Merge pull request #17 from nathon-lee/feat/zero3-elastic-checkpoint-…
c26b8379
docs(zero): document ZeRO-3 elastic checkpoint support
526d5c27
Merge pull request #18 from nathon-lee/feat/zero3-elastic-checkpoint-…
7eb7e033
Merge branch 'deepspeedai:master' into feat/zero3-elastic-checkpoint
a7d728c2
Merge branch 'deepspeedai:master' into feat/zero3-elastic-checkpoint
0d95e79b
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub