DeepSpeed
Elastic Training support in DeepSpeed
#2153
Merged

Elastic Training support in DeepSpeed #2153

awan-10 merged 30 commits into staging-ft-elastic-v1 from arpan/elasticity
aj-prime
proof of concept for elastic training using pytorch
c5ce9d82
Add command line options for elastic training
5635052a
Remove functionAgent
9b5e72cf
Add NCCL BLOCKING ERROR flag to elastic training
4881ce8f
transient change
9f1c997b
Added DS elastic agent
183b6bf1
Cleanup
4d024f07
pass environment variables to worker processes
1ab31049
Enable elastic checkpoint for scale down in elastic training
f55b767c
added detection of master addr and port on rank 0
e37c7619
fixed formatting
a41757d0
Merge lastest master to elasticity branch
95f70ce6
add launch.py and elastic_agent.py files to skip list in torchdist check
f6375690
add pytorch dependency for elastic training
8aafc3a5
add function for checking pytorch version
b0a8802d
added kill command for pdsh when SIGINT is received
96d678bd
re-enable elastic checkpoint assertion
3ac51197
aj-prime Merge branch 'staging-ft-elastic-v1' into arpan/elasticity
a23594ee
Add support for variable batch size
4f9c5351
Fix elasticity V2, enable pipeline parallelism in Elastic Training, a…
d995fb3b
updated elastic unit test
706ebcec
added an assertion for moded-parallel support and added code to prote…
9063a948
modified elastic training unit test, added config options for elastic…
f4ace715
aj-prime aj-prime requested a review from jeffra jeffra 3 years ago
aj-prime aj-prime requested a review from samyam samyam 3 years ago
aj-prime aj-prime requested a review from tjruwase tjruwase 3 years ago
aj-prime aj-prime requested a review from ShadenSmith ShadenSmith 3 years ago
aj-prime aj-prime requested a review from conglongli conglongli 3 years ago
aj-prime aj-prime requested a review from awan-10 awan-10 3 years ago
aj-prime aj-prime requested a review from cli99 cli99 3 years ago
aj-prime aj-prime requested a review from eltonzheng eltonzheng 3 years ago
aj-prime aj-prime requested a review from minjiaz minjiaz 3 years ago
aj-prime aj-prime requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 3 years ago
aj-prime aj-prime requested a review from duli2012 duli2012 3 years ago
aj-prime aj-prime requested a review from mrwyattii mrwyattii 3 years ago
aj-prime aj-prime requested a review from yaozhewei yaozhewei 3 years ago
aj-prime aj-prime requested a review from arashb arashb 3 years ago
aj-prime aj-prime requested a review from xiaoxiawu-microsoft xiaoxiawu-microsoft 3 years ago
aj-prime aj-prime requested a review from samadejacobs samadejacobs 3 years ago
resolved conflicts
bb4a7f3d
fixed a typo
6ed80660
fixed test_elastic
f2405bd3
removed extra imports
66205a12
tjruwase
tjruwase commented on 2022-07-29
tjruwase
tjruwase commented on 2022-07-29
renamed min and max nodes arguments
7e601b3a
use deafult elastic ID
6de2cd87
jeffra
jeffra commented on 2022-07-29
expose elastic run id as an env variable
ae26b52a
jeffra
jeffra approved these changes on 2022-07-29
awan-10 awan-10 merged 63ae1c5e into staging-ft-elastic-v1 3 years ago
mrwyattii mrwyattii deleted the arpan/elasticity branch 2 years ago

Login to write a write a comment.

Login via GitHub