DeepSpeed
[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load
#1525
Merged

[ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load #1525

jeffra merged 15 commits into master from zero-ckpt-cpu-issue
jeffra
jeffra jeffra requested a review from awan-10 awan-10 4 years ago
jeffra jeffra requested a review from cli99 cli99 4 years ago
jeffra jeffra requested a review from conglongli conglongli 4 years ago
jeffra jeffra requested a review from eltonzheng eltonzheng 4 years ago
jeffra jeffra requested a review from minjiaz minjiaz 4 years ago
jeffra jeffra requested a review from niumanar niumanar 4 years ago
jeffra jeffra requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 4 years ago
jeffra jeffra requested a review from samyam samyam 4 years ago
jeffra jeffra requested a review from ShadenSmith ShadenSmith 4 years ago
jeffra jeffra requested a review from tjruwase tjruwase 4 years ago
tjruwase
tjruwase commented on 2021-11-05
tjruwase
tjruwase approved these changes on 2021-11-05
jeffra jeffra changed the title Reduce CPU memory overhead during ZeRO checkpoint loading [ZeRO] Default disable elastic ckpt in stage 1+2 and reduce CPU memory overhead during ckpt load 4 years ago
jeffra jeffra force pushed to 45a416e1 4 years ago
jeffra [squash] zero-ckpt-cpu-issue (#1673)
0fc11fa0
jeffra formatting
dbd08236
jeffra jeffra force pushed from 09260b6c to dbd08236 3 years ago
tjruwase Merge branch 'master' into zero-ckpt-cpu-issue
92d87f0c
tjruwase Reduce cpu memory of loading in rigid mode
a6b6770f
tjruwase Merge branch 'master' into zero-ckpt-cpu-issue
21e173be
tjruwase Allocate tensor on param device
cd4ce852
tjruwase Merge branch 'zero-ckpt-cpu-issue' of github.com:microsoft/DeepSpeed …
4b0d366a
tjruwase
tjruwase commented on 2022-01-06
jeffra
jeffra commented on 2022-01-07
tjruwase Merge branch 'master' into zero-ckpt-cpu-issue
571b0a2c
jeffra add WS check + several unit tests for ckpting (TODO: need to fix a fe…
a4b40fa7
jeffra uncomment exception check in ckpt test
64975092
jeffra Merge branch 'master' into zero-ckpt-cpu-issue
477dc89c
tjruwase Merge branch 'master' into zero-ckpt-cpu-issue
61bdfece
jeffra Merge branch 'master' into zero-ckpt-cpu-issue
c13305f0
jeffra fixes for remaining unit tests
091071de
jeffra Merge branch 'master' into zero-ckpt-cpu-issue
4add9306
jeffra jeffra enabled auto-merge (squash) 3 years ago
disabled auto-merge 3 years ago
Manually disabled by user
jeffra jeffra merged 3293cf72 into master 3 years ago
jeffra jeffra deleted the zero-ckpt-cpu-issue branch 3 years ago
stas00

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone