Elastic training support #602
Starting to add config modifications. Currently in incomplete state
a2689d94
Adding the core elasticity compatible gpu count generation logic
cc54b0f8
Reverting some of the unfinished modifications to get the file workin…
3ee0bdd5
formatting and fix build error
b16518f0
add np req and move elasticity
17168589
update github actions to trigger on all branches
6e7896a6
fix syntax error
072ace3b
exclude docs
64b6ef17
formatting
56ec5130
config restructure, versioning, etc
a6039705
config updates, sanity checks, etc.
fbbd94db
fix version issue
78fd37a8
choose best micro batch size for given world size
ca94dc8c
bug fixes
bdf34150
add unit test
5391541b
add several unit tests and clean-up code
86916423
fix install issue when installing on non-gpu machines
805a0678
Merge branch 'master' into jeffra/elastic
cd44debe
Merge branch 'master' into jeffra/elastic
3275d9ca
Merge branch 'master' into jeffra/elastic
d64f6317
Merge branch 'master' into jeffra/elastic
cbf90631
add ds_elastic cli
07caa68e
clean-up
b4f6d713
formatting
8b784ce4
docstring
dd309928
fix mbsize division issue
c925a534
formatting
16f9aa2d
checkpoint load latest only if it exists
80a642fe
add get_batch_info to engine, assert non-elastic bsz config, fix test
01ee3a4f
fix tests
c6a23c1f
validate elastic config wrt scheduler config, add repr
2e6b35f0
add unit test and fixes
1af4330f
require max-batch and micro-batches for elastic training
d0305834
fix test error
6b6235ba
jeffra
marked this pull request as ready for review 5 years ago
jeffra
merged
81aeea36
into master 5 years ago
mrwyattii
deleted the jeffra/elastic branch 2 years ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub