DeepSpeed
Elastic Training support in DeepSpeed
#2153
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
30
Changes
View On
GitHub
Elastic Training support in DeepSpeed
#2153
awan-10
merged 30 commits into
staging-ft-elastic-v1
from
arpan/elasticity
proof of concept for elastic training using pytorch
c5ce9d82
Add command line options for elastic training
5635052a
Remove functionAgent
9b5e72cf
Add NCCL BLOCKING ERROR flag to elastic training
4881ce8f
transient change
9f1c997b
Added DS elastic agent
183b6bf1
Cleanup
4d024f07
pass environment variables to worker processes
1ab31049
Enable elastic checkpoint for scale down in elastic training
f55b767c
added detection of master addr and port on rank 0
e37c7619
fixed formatting
a41757d0
Merge lastest master to elasticity branch
95f70ce6
add launch.py and elastic_agent.py files to skip list in torchdist check
f6375690
add pytorch dependency for elastic training
8aafc3a5
add function for checking pytorch version
b0a8802d
added kill command for pdsh when SIGINT is received
96d678bd
re-enable elastic checkpoint assertion
3ac51197
Merge branch 'staging-ft-elastic-v1' into arpan/elasticity
a23594ee
Add support for variable batch size
4f9c5351
Fix elasticity V2, enable pipeline parallelism in Elastic Training, a…
d995fb3b
updated elastic unit test
706ebcec
added an assertion for moded-parallel support and added code to prote…
9063a948
modified elastic training unit test, added config options for elastic…
f4ace715
aj-prime
requested a review
from
jeffra
3 years ago
aj-prime
requested a review
from
samyam
3 years ago
aj-prime
requested a review
from
tjruwase
3 years ago
aj-prime
requested a review
from
ShadenSmith
3 years ago
aj-prime
requested a review
from
conglongli
3 years ago
aj-prime
requested a review
from
awan-10
3 years ago
aj-prime
requested a review
from
cli99
3 years ago
aj-prime
requested a review
from
eltonzheng
3 years ago
aj-prime
requested a review
from
minjiaz
3 years ago
aj-prime
requested a review
from
RezaYazdaniAminabadi
3 years ago
aj-prime
requested a review
from
duli2012
3 years ago
aj-prime
requested a review
from
mrwyattii
3 years ago
aj-prime
requested a review
from
yaozhewei
3 years ago
aj-prime
requested a review
from
arashb
3 years ago
aj-prime
requested a review
from
xiaoxiawu-microsoft
3 years ago
aj-prime
requested a review
from
samadejacobs
3 years ago
resolved conflicts
bb4a7f3d
fixed a typo
6ed80660
fixed test_elastic
f2405bd3
removed extra imports
66205a12
tjruwase
commented on 2022-07-29
tjruwase
commented on 2022-07-29
renamed min and max nodes arguments
7e601b3a
use deafult elastic ID
6de2cd87
jeffra
commented on 2022-07-29
expose elastic run id as an env variable
ae26b52a
jeffra
approved these changes on 2022-07-29
awan-10
merged
63ae1c5e
into staging-ft-elastic-v1
3 years ago
mrwyattii
deleted the arpan/elasticity branch
2 years ago
Login to write a write a comment.
Login via GitHub
Reviewers
jeffra
tjruwase
samyam
ShadenSmith
conglongli
awan-10
cli99
eltonzheng
minjiaz
RezaYazdaniAminabadi
duli2012
mrwyattii
yaozhewei
arashb
xiaoxiawu-microsoft
samadejacobs
Assignees
No one assigned
Labels
None yet
Milestone
No milestone
Login to write a write a comment.
Login via GitHub