a branch combining layer-norm-auto-sync and ds_ckpt_reshape #292
Reshape deepspeed checkpoint
67c08f09
Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
fec1ec5f
add checkpoint tests
675f12ca
Validate input folder
e379065b
Tests for tp/pp reshape
a1068e4d
remove debug folders
115bd313
fix test_checkpoint_reshaping_empty_dir
cc2fad1f
Fix unit tests
b6733d57
Remove deepspeed checkpoint utils
9bf7ac51
Use DS 3D reshaping utils
29ca2bcc
sync layer norms
9ba8210f
all_reduce is an in_place operation
fbd47eed
Make dataloader use another random generator (#276)
a9fb317e
do all_reduce op.AVG directly
8c1ed225
Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
8937dede
add eval dataloader deadlock workaround
b015ec15
Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
a3ef7783
convert to bf16
6d863582
wip universal chkpt
804b497d
rename
c29d3369
rename
9c447933
revert generator sync
10f50184
wip on fragments dealing
7e0a81b9
cleanup
d3005120
Loading universal checkpoint with reshaping
ab0a7f8f
all gpu1<->2 reshapes work
d5e33dec
param attrs
85ff56ca
make the tests adaptable to the number of available gpus
f01fa4a5
WIP
f29bacc1
WIP
dd0aeb67
WIP
3bf14fdf
WIP
7ae002d3
Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
5be13991
Debug functions
55bb5148
args should be required, don't create another latest file
795fedbb
Parallelize shard extraction
cc8810be
close+join pool; add tqdm; comment out noise
04d9ad0f
rename
bca5af4e
parameterize
721380b2
Parallel slice merging
e8a1ccf1
Cleanup
a247614b
Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
9bb3dc33
Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
dedf15c6
allow inspection on a machine w/o gpus
d845a1f0
Merge branch 'layer-norm-auto-sync' into ds_ckpt_reshape-with-layer-n…
c44c85b9
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub