DeepSpeed
Elastic training support
#602
Merged

Commits
  • Starting to add config modifications. Currently in incomplete state
    jeffra committed 5 years ago
  • Adding the core elasticity compatible gpu count generation logic
    jeffra committed 5 years ago
  • Reverting some of the unfinished modifications to get the file working as standalone'
    jeffra committed 5 years ago
  • formatting and fix build error
    jeffra committed 5 years ago
  • add np req and move elasticity
    jeffra committed 5 years ago
  • update github actions to trigger on all branches
    jeffra committed 5 years ago
  • fix syntax error
    jeffra committed 5 years ago
  • exclude docs
    jeffra committed 5 years ago
  • formatting
    jeffra committed 5 years ago
  • config restructure, versioning, etc
    jeffra committed 5 years ago
  • config updates, sanity checks, etc.
    jeffra committed 5 years ago
  • fix version issue
    jeffra committed 5 years ago
  • choose best micro batch size for given world size
    jeffra committed 5 years ago
  • bug fixes
    jeffra committed 5 years ago
  • add unit test
    jeffra committed 5 years ago
  • add several unit tests and clean-up code
    jeffra committed 5 years ago
  • fix install issue when installing on non-gpu machines
    jeffra committed 5 years ago
  • Merge branch 'master' into jeffra/elastic
    jeffra committed 5 years ago
  • Merge branch 'master' into jeffra/elastic
    jeffra committed 5 years ago
  • Merge branch 'master' into jeffra/elastic
    jeffra committed 5 years ago
  • Merge branch 'master' into jeffra/elastic
    jeffra committed 5 years ago
  • add ds_elastic cli
    jeffra committed 5 years ago
  • clean-up
    jeffra committed 5 years ago
  • formatting
    jeffra committed 5 years ago
  • docstring
    jeffra committed 5 years ago
  • fix mbsize division issue
    jeffra committed 5 years ago
  • formatting
    jeffra committed 5 years ago
  • checkpoint load latest only if it exists
    jeffra committed 5 years ago
  • add get_batch_info to engine, assert non-elastic bsz config, fix test
    jeffra committed 5 years ago
  • fix tests
    jeffra committed 5 years ago
  • validate elastic config wrt scheduler config, add repr
    jeffra committed 5 years ago
  • add unit test and fixes
    jeffra committed 5 years ago
  • require max-batch and micro-batches for elastic training
    jeffra committed 5 years ago
  • fix test error
    jeffra committed 5 years ago
Loading