DeepSpeed
Checkpoint reshaping
#1953
Merged

Checkpoint reshaping #1953

tjruwase merged 71 commits into master from olruwase/elastic-ckpt-refresh
tjruwase
jeffra unit test, remove exception, add notes
24fe7002
tjruwase Merge branch 'master' of github.com:microsoft/DeepSpeed into elastic-…
70a68d08
tjruwase Move param_shapes to model files
aafa4e57
tjruwase Remove hard-coded constants
162c19b3
tjruwase Merge branch 'olruwase/relocate_param_shapes' of github.com:microsoft…
84c5d170
tjruwase Merge branch 'master' into olruwase/relocate_param_shapes
59e86dd0
tjruwase Conditioned to zero optimizer
680e6207
tjruwase Merge branch 'olruwase/relocate_param_shapes' of github.com:microsoft…
8bf3c4e1
tjruwase Add zero checkpoint merging
f1b5d16b
tjruwase Merge branch 'olruwase/relocate_param_shapes' of github.com:microsoft…
58d34953
jeffra Merge branch 'master' into olruwase/relocate_param_shapes
145638d8
tjruwase Print checkpoint version
fd8c3e68
tjruwase Merge branch 'olruwase/relocate_param_shapes' of github.com:microsoft…
d85a6df0
tjruwase Merge with relocate_param_shapes
c642600c
tjruwase Reshape zero_* ckpt files
c8689fd2
tjruwase Merge zero* files contraction
4a86c1a5
tjruwase Utils for 3D contraction reshaping
f5db8df8
tjruwase Rebase
d5c68438
tjruwase tjruwase requested a review from jeffra jeffra 3 years ago
tjruwase tjruwase requested a review from samyam samyam 3 years ago
tjruwase tjruwase requested a review from ShadenSmith ShadenSmith 3 years ago
tjruwase tjruwase requested a review from conglongli conglongli 3 years ago
tjruwase tjruwase requested a review from awan-10 awan-10 3 years ago
tjruwase tjruwase requested a review from cli99 cli99 3 years ago
tjruwase tjruwase requested a review from eltonzheng eltonzheng 3 years ago
tjruwase tjruwase requested a review from minjiaz minjiaz 3 years ago
tjruwase tjruwase requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 3 years ago
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
e6179201
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
ef8a4a73
tjruwase Remove bogus import
c12a4e7f
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
0b2c33bf
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
86efe30f
tjruwase Support bf16_zero ckpts
1031b324
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
6f294658
tjruwase Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…
8f23728c
tjruwase Add param slice mappings
fd1a377f
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
3d4a27b5
tjruwase Load universal checkpoints
10083db7
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
567454a5
tjruwase Per group mappings from Stas
22c75505
tjruwase Hack to load bf16 zero files
5df4135c
tjruwase Param attributes
ae2825fd
tjruwase WIP
d11a8dc2
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
7948c45a
tjruwase Fix api bug
691b29d1
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
a05f9532
tjruwase Update lp with local/remote hp
c0a42d36
tjruwase Disable vocab padding handling
b4ca4556
stas00
tjruwase Update z2 checkpoint
b8b54c83
tjruwase Remove debug prints
be86df9b
tjruwase Remove debug prints; Rebase unit test
c87543b7
tjruwase Add reshape assert
c18ff2d0
tjruwase Padding
4ea36b74
tjruwase Typo
03715817
tjruwase Catch nonexistent checkpoint path
a74abc1e
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
2b707f2d
tjruwase Cleanup
529dbaeb
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
e126d2e4
deepspeedai deepspeedai deleted a comment from rocm-mici on 2022-06-09
jeffra
jeffra commented on 2022-06-09
tjruwase Restore checkpoint state comparisons
9e2766fa
mrwyattii
mrwyattii approved these changes on 2022-06-10
jeffra
jeffra approved these changes on 2022-06-10
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
5c90ef1e
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
726982ba
jeffra Merge branch 'master' into olruwase/elastic-ckpt-refresh
add1d0c9
jeffra jeffra requested a review from duli2012 duli2012 3 years ago
jeffra jeffra requested a review from yaozhewei yaozhewei 3 years ago
jeffra jeffra requested a review from arashb arashb 3 years ago
jeffra jeffra requested a review from xiaoxiawu-microsoft xiaoxiawu-microsoft 3 years ago
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
5fca3db8
Muennighoff
Muennighoff commented on 2022-06-20
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
901b1e63
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
30896ded
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
93934f6a
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
ecb3dc8a
Muennighoff
Muennighoff commented on 2022-06-23
Muennighoff
Muennighoff commented on 2022-06-23
Muennighoff
Muennighoff commented on 2022-06-29
stas00
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
6c7d947e
stas00
stas00 commented on 2022-07-04
stas00
stas00 commented on 2022-07-04
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
cd8dea73
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
4217be24
tjruwase tjruwase requested a review from samadejacobs samadejacobs 3 years ago
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
206e630f
tjruwase Add torch version guards
14980ad4
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
f3145818
tjruwase More precise avoidance of false positives.
868c463a
tjruwase Merge branch 'olruwase/elastic-ckpt-refresh' of github.com:microsoft/…
e22487af
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
e0da15f9
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
623430e0
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
2556578b
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
bf57d814
tjruwase Merge branch 'master' into olruwase/elastic-ckpt-refresh
e4a5a464
tjruwase tjruwase merged 80d0a32f into master 3 years ago

Login to write a write a comment.

Login via GitHub