DeepSpeed
parallelize writing of layer checkpoint files across data parallel instances
#1419
Merged

parallelize writing of layer checkpoint files across data parallel instances #1419

tjruwase merged 13 commits into deepspeedai:master from adammoody:layerckpt
adammoody
adammoody adammoody requested a review from awan-10 awan-10 4 years ago
adammoody adammoody requested a review from cli99 cli99 4 years ago
adammoody adammoody requested a review from conglongli conglongli 4 years ago
adammoody adammoody requested a review from eltonzheng eltonzheng 4 years ago
adammoody adammoody requested a review from jeffra jeffra 4 years ago
adammoody adammoody requested a review from minjiaz minjiaz 4 years ago
adammoody adammoody requested a review from niumanar niumanar 4 years ago
adammoody adammoody requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 4 years ago
adammoody adammoody requested a review from samyam samyam 4 years ago
adammoody adammoody requested a review from ShadenSmith ShadenSmith 4 years ago
adammoody adammoody requested a review from tjruwase tjruwase 4 years ago
adammoody adammoody force pushed from 1ac98950 to 9fbeb42f 4 years ago
adammoody
adammoody adammoody changed the title WIP: parallelize layer checkpoints across data parallel instances parallelize writing of layer checkpoint files across data parallel instances 4 years ago
adammoody
adammoody adammoody force pushed from 9fbeb42f to 1cee52dd 3 years ago
adammoody
adammoody
adammoody
rocm-mici
adammoody
stas00
tjruwase
adammoody
adammoody
adammoody
adammoody
stas00
tjruwase
tjruwase
tjruwase commented on 2022-09-20
adammoody parallelize layer checkpoints across data parallel groups
8fef9f6c
adammoody adammoody force pushed from 1cee52dd to 8fef9f6c 3 years ago
adammoody adammoody requested a review from duli2012 duli2012 3 years ago
adammoody adammoody requested a review from mrwyattii mrwyattii 3 years ago
adammoody adammoody requested a review from yaozhewei yaozhewei 3 years ago
adammoody adammoody requested a review from arashb arashb 3 years ago
adammoody adammoody requested a review from xiaoxiawu-microsoft xiaoxiawu-microsoft 3 years ago
adammoody adammoody requested a review from samadejacobs samadejacobs 3 years ago
adammoody adammoody requested a review from cmikeh2 cmikeh2 3 years ago
adammoody adammoody requested a review from GuanhuaWang GuanhuaWang 3 years ago
adammoody
tjruwase
adammoody
adammoody commented on 2022-09-20
adammoody
adammoody use partition_uniform to determine start/end index values
c64a7d4e
adammoody
GuanhuaWang
adammoody
adammoody formatting fix
6f8c9d1e
adammoody
tjruwase Merge branch 'master' into layerckpt
fa99397f
tjruwase
adammoody
tjruwase Merge branch 'master' into layerckpt
f05dc913
tjruwase
adammoody config: add option for parallel write of layer checkpoints in pipelin…
ed8bc48e
adammoody adammoody force pushed from e6a45fd6 to ed8bc48e 3 years ago
adammoody
adammoody yapf fixes
92f6a840
adammoody enable parallel layer write according to config param
2dbf0a4f
adammoody
adammoody avoid extraneous makedir when rank 0 writes all layers
2f311e99
adammoody
tjruwase Merge branch 'master' into layerckpt
27002cfa
tjruwase
tjruwase approved these changes on 2022-10-10
tjruwase Merge branch 'master' into layerckpt
f54324a5
tjruwase Merge branch 'master' into layerckpt
6d5518b3
tjruwase Merge branch 'master' into layerckpt
0e4d92b7
tjruwase tjruwase merged b8fb9c3f into master 3 years ago
tjruwase
adammoody
adammoody adammoody deleted the layerckpt branch 3 years ago
adammoody

Login to write a write a comment.

Login via GitHub