DeepSpeed
[Draft] Add ZeRO-3 elastic checkpoint save/load support
#8031
Open

[Draft] Add ZeRO-3 elastic checkpoint save/load support #8031

nathon-lee
Copilot Initial plan
001f77c3
Copilot Revert "fix: update 1 file reformatted."
b90aee5a
nathon-lee Merge pull request #5 from nathon-lee/copilot/git-revert-ff886701
b6da9afd
nathon-lee Merge branch 'deepspeedai:master' into master
bb7f64fd
Copilot Initial plan
cbc816c9
Copilot Reapply "fix: update 1 file reformatted."
5fcc9a7e
nathon-lee Merge pull request #6 from nathon-lee/copilot/remove-commits-from-master
f7c5d75d
nathon-lee Merge branch 'deepspeedai:master' into master
18efbcc3
nathon-lee Merge branch 'deepspeedai:master' into master
e2ac74d2
nathon-lee Merge branch 'deepspeedai:master' into master
da07382d
nathon-lee Merge branch 'deepspeedai:master' into master
5d8875cc
nathon-lee Merge branch 'deepspeedai:master' into master
316b6dda
nathon-lee Merge branch 'deepspeedai:master' into master
2020543f
nathon-lee Merge branch 'deepspeedai:master' into master
1a8694c6
nathon-lee Merge branch 'deepspeedai:master' into master
d6725be0
nathon-lee Merge branch 'deepspeedai:master' into master
a06c5487
nathon-lee Merge branch 'deepspeedai:master' into master
6959eb4b
nathon-lee Merge branch 'deepspeedai:master' into master
e88eb3e3
nathon-lee feat(zero): implement elastic checkpoint support for ZeRO-3
196f60c2
nathon-lee [ZeRO-3]: Implement elastic checkpoint save and load
773f32ea
nathon-lee [ZeRO-3]: Add unit tests for elastic checkpoint
9af52c28
nathon-lee fix: fix stage-3 elastic checkpoint cross-world-size save/load
3b084846
nathon-lee fix(zero3): move restored elastic optimizer state to correct device
b502ee11
nathon-lee Merge pull request #17 from nathon-lee/feat/zero3-elastic-checkpoint-…
c26b8379
nathon-lee docs(zero): document ZeRO-3 elastic checkpoint support
526d5c27
nathon-lee Merge pull request #18 from nathon-lee/feat/zero3-elastic-checkpoint-…
7eb7e033
nathon-lee Merge branch 'deepspeedai:master' into feat/zero3-elastic-checkpoint
a7d728c2
nathon-lee Merge branch 'deepspeedai:master' into feat/zero3-elastic-checkpoint
0d95e79b

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone