Megatron-DeepSpeed
a branch combining layer-norm-auto-sync and ds_ckpt_reshape
#292
Open

Commits
  • Reshape deepspeed checkpoint
    tjruwase committed 4 years ago
  • Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
    stas00 committed 4 years ago
  • add checkpoint tests
    stas00 committed 4 years ago
  • Validate input folder
    tjruwase committed 4 years ago
  • Tests for tp/pp reshape
    tjruwase committed 4 years ago
  • remove debug folders
    stas00 committed 4 years ago
  • fix test_checkpoint_reshaping_empty_dir
    stas00 committed 4 years ago
  • Fix unit tests
    tjruwase committed 4 years ago
  • Remove deepspeed checkpoint utils
    tjruwase committed 4 years ago
  • Use DS 3D reshaping utils
    tjruwase committed 4 years ago
  • sync layer norms
    stas00 committed 4 years ago
  • all_reduce is an in_place operation
    thomasw21 committed 4 years ago
  • Make dataloader use another random generator (#276)
    thomasw21 committed 4 years ago
  • do all_reduce op.AVG directly
    stas00 committed 4 years ago
  • Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
    thomasw21 committed 3 years ago
  • add eval dataloader deadlock workaround
    stas00 committed 3 years ago
  • Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
    stas00 committed 3 years ago
  • convert to bf16
    stas00 committed 3 years ago
  • wip universal chkpt
    stas00 committed 3 years ago
  • rename
    stas00 committed 3 years ago
  • rename
    stas00 committed 3 years ago
  • revert generator sync
    stas00 committed 3 years ago
  • wip on fragments dealing
    stas00 committed 3 years ago
  • cleanup
    stas00 committed 3 years ago
  • Loading universal checkpoint with reshaping
    tjruwase committed 3 years ago
  • all gpu1<->2 reshapes work
    stas00 committed 3 years ago
  • param attrs
    tjruwase committed 3 years ago
  • make the tests adaptable to the number of available gpus
    stas00 committed 3 years ago
  • WIP
    tjruwase committed 3 years ago
  • WIP
    tjruwase committed 3 years ago
  • WIP
    tjruwase committed 3 years ago
  • WIP
    tjruwase committed 3 years ago
  • Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
    stas00 committed 3 years ago
  • Debug functions
    tjruwase committed 3 years ago
  • args should be required, don't create another latest file
    stas00 committed 3 years ago
  • Parallelize shard extraction
    tjruwase committed 3 years ago
  • close+join pool; add tqdm; comment out noise
    stas00 committed 3 years ago
  • rename
    stas00 committed 3 years ago
  • parameterize
    stas00 committed 3 years ago
  • Parallel slice merging
    tjruwase committed 3 years ago
  • Cleanup
    tjruwase committed 3 years ago
  • Merge remote-tracking branch 'origin/main' into ds_ckpt_reshape
    stas00 committed 3 years ago
  • Merge remote-tracking branch 'origin/main' into layer-norm-auto-sync
    stas00 committed 3 years ago
  • allow inspection on a machine w/o gpus
    stas00 committed 3 years ago
  • Merge branch 'layer-norm-auto-sync' into ds_ckpt_reshape-with-layer-norm-auto-sync
    stas00 committed 3 years ago
Loading