DeepSpeed
Elastic training support
#602
Merged

Elastic training support #602

jeffra merged 34 commits into master from jeffra/elastic
jeffra
samyam Starting to add config modifications. Currently in incomplete state
a2689d94
samyam Adding the core elasticity compatible gpu count generation logic
cc54b0f8
samyam Reverting some of the unfinished modifications to get the file workin…
3ee0bdd5
jeffra formatting and fix build error
b16518f0
jeffra add np req and move elasticity
17168589
jeffra update github actions to trigger on all branches
6e7896a6
jeffra fix syntax error
072ace3b
jeffra exclude docs
64b6ef17
jeffra formatting
56ec5130
jeffra config restructure, versioning, etc
a6039705
jeffra config updates, sanity checks, etc.
fbbd94db
jeffra fix version issue
78fd37a8
jeffra choose best micro batch size for given world size
ca94dc8c
jeffra bug fixes
bdf34150
jeffra add unit test
5391541b
jeffra add several unit tests and clean-up code
86916423
jeffra fix install issue when installing on non-gpu machines
805a0678
jeffra Merge branch 'master' into jeffra/elastic
cd44debe
jeffra Merge branch 'master' into jeffra/elastic
3275d9ca
jeffra Merge branch 'master' into jeffra/elastic
d64f6317
jeffra Merge branch 'master' into jeffra/elastic
cbf90631
jeffra add ds_elastic cli
07caa68e
jeffra clean-up
b4f6d713
jeffra formatting
8b784ce4
jeffra docstring
dd309928
jeffra fix mbsize division issue
c925a534
jeffra formatting
16f9aa2d
jeffra checkpoint load latest only if it exists
80a642fe
jeffra add get_batch_info to engine, assert non-elastic bsz config, fix test
01ee3a4f
jeffra fix tests
c6a23c1f
jeffra validate elastic config wrt scheduler config, add repr
2e6b35f0
jeffra add unit test and fixes
1af4330f
jeffra require max-batch and micro-batches for elastic training
d0305834
jeffra fix test error
6b6235ba
jeffra jeffra marked this pull request as ready for review 5 years ago
jeffra jeffra requested a review from arashashari arashashari 5 years ago
jeffra jeffra requested a review from awan-10 awan-10 5 years ago
jeffra jeffra requested a review from cli99 cli99 5 years ago
jeffra jeffra requested a review from conglongli conglongli 5 years ago
jeffra jeffra requested a review from eltonzheng eltonzheng 5 years ago
jeffra jeffra requested a review from minjiaz minjiaz 5 years ago
jeffra jeffra requested a review from niumanar niumanar 5 years ago
jeffra jeffra requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 5 years ago
jeffra jeffra requested a review from samyam samyam 5 years ago
jeffra jeffra requested a review from ShadenSmith ShadenSmith 5 years ago
jeffra jeffra requested a review from tjruwase tjruwase 5 years ago
jeffra jeffra merged 81aeea36 into master 5 years ago
g-karthik
mrwyattii mrwyattii deleted the jeffra/elastic branch 2 years ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone