ZeRO-2 #217

jeffra merged 57 commits into master from zero2-staging
jeffra
jeffra Squased dev zero (#14)
e67a3def
samyam Support for new apex style optimizer.step(), grad_clip bug fix in Zer…
110b7ad1
samyam Formatting fix
4e33696c
jeffra Fix several unit tests, some still broken
84f5ba6a
samyam Adding activation checkpointing as deepspeed file
ed8841f9
samyam adding hash to deepspeed_checkpointing
346c5c28
samyam formatting
30c4f6c5
jeffra zero stage 2 does not support grad accu, also fix formatting
22572064
jeffra fix checkpoint tests, remove catch all try/except, increase pytest ti…
1ce64def
jeffra fix test_zero_static_scale unit test
eb6f3c65
jeffra disable empty partition test for now
6b3b2a8b
jeffra remove conversion of loss to float before scaling (tests failing)
e21e1350
jeffra update squad model tests to run zero2 and use bsz=6
33224c77
jeffra add squad zero2 config
4dca61e8
samyam Adding support for deepspeed_checkpointing through deepspeed.checkpoi…
0ca3bd5b
samyam Removed redundant model tests. Added testing deepspeed activation che…
720fab7b
jeffra return float conversion in backward, convert loss to float before gra…
c0cb47ea
samyam Refactored deepspeed.checkpointing API to pass ds_config directly to …
4f433ee0
samyam fixing test paths
aebd1f29
tjruwase Optional loading of optimizer and learning rate scheduler states in
dace62ce
tjruwase Fix formatting issues
16e82024
tjruwase Strict option for checkpoint loading
54f47d42
tjruwase Enable loading checkpoints without optimizer state with different DP …
ac8a526d
tjruwase Fix bug
9d0b194a
samyam Updating Megatron Tutorial
21cb8217
samyam replacing perf section in Megatron Tutorial
72210701
samyam megatron tutorial updates and activation checkpointing json configs
8a555b20
samyam Documentation : Added code comments to deepspeed.checkpointing \ Adde…
591bdc63
samyam Updating Megatron Tutorial
0ce25619
samyam replacing perf section in Megatron Tutorial
5865b9a5
ShadenSmith getting docs to build
624eca51
ShadenSmith getting docs to build
4e391da5
samyam formatting
2ee9706d
samyam Addressing Jeff and Shaden's feedback on documentation
3648a87b
eltonzheng add compute & communication overlapping
d0940a03
eltonzheng fix the format error
0b44ab68
jeffra update DSE submodule to point to DS 0.2 version
9131be09
eltonzheng update according to code review feedback
1c17fa77
eltonzheng check contigious_gradients before using previous_reduced_grads
e85d3c3d
samyam Adding more documentations, in features.md and index.md (#33)
b12eb2e6
jeffra update nav bar docs
399cc2da
jeffra change ordering of tutorials
8d8290ae
jeffra reorder nav
b68929e7
ShadenSmith contigious -> contiguous
8f7322cd
fixing my merge handiwork (#43)
fd111c0f
updating code docs (#42)
4f9e6502
Web/doc edits (#45)
e1504d81
arashashari Adding back previous zero optimization (bool) (#44)
16f96b2d
jeffra update DSE commit
219e6220
jeffra Merge branch 'master' into zero2-staging
28f01e0e
samyam few sentences on low bandwidth clusters in Megatron Tutorial (#46)
2a915cd4
ShadenSmith Merge branch 'dev-zero-may1' into zero2-staging
8c4f9405
ShadenSmith news updates
1ab5961c
jeffra bump to v0.2.0, ignore *.log, use cache_dir in megatron tests
914aa356
ShadenSmith blurb for news items
c352f164
jeffra update
c7cbcdac
ShadenSmith blurb updates
02eb292b
jeffra jeffra merged f2ac7eaf into master 5 years ago
jeffra jeffra deleted the zero2-staging branch 5 years ago

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone