DeepSpeed
Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support)
#1453
Merged

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) #1453

tjruwase merged 91 commits into deepspeedai:master from jfc4050:s3-pr
jfc4050
raamjad Changes for bfloat16 Zero2
fe264232
ZeRO stage3 optimizations, with some bug fixes
8864f911
jfc4050 jfc4050 requested a review from awan-10 awan-10 4 years ago
jfc4050 jfc4050 requested a review from cli99 cli99 4 years ago
jfc4050 jfc4050 requested a review from conglongli conglongli 4 years ago
jfc4050 jfc4050 requested a review from eltonzheng eltonzheng 4 years ago
jfc4050 jfc4050 requested a review from jeffra jeffra 4 years ago
jfc4050 jfc4050 requested a review from minjiaz minjiaz 4 years ago
jfc4050 jfc4050 requested a review from niumanar niumanar 4 years ago
jfc4050 jfc4050 requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 4 years ago
jfc4050 jfc4050 requested a review from samyam samyam 4 years ago
jfc4050 jfc4050 requested a review from ShadenSmith ShadenSmith 4 years ago
jfc4050 jfc4050 requested a review from tjruwase tjruwase 4 years ago
ghost
tjruwase
tjruwase
tjruwase
jfc4050
fix import in ut
e66aedc2
ran yapf
350a7a02
tjruwase
stas00
stas00 commented on 2021-10-12
stas00
jfc4050
tjruwase
jfc4050
zarzen
zarzen commented on 2021-10-13
tjruwase Merge branch 'master' into s3-pr
b37a4f01
improvements to cache flush warn log
f3839476
backwards compatibility with older versions of pytorch
b2a1c954
handle edge case where reduced tensor smaller than world size
d8678fa7
moved event synchronization to allgather handle wait() call
a0faca0b
removed unnecessary barrier call
bf20c90c
jfc4050 Merge branch 'master' into s3-pr
a353017c
zarzen
zarzen commented on 2021-10-14
formatting fix after resolving merge conflict
c51ba461
skip nvme prefetch when trace not complete
ff01f5cc
opportunistically avoid memory allocation in allgather coalesced wher…
13093eb8
tjruwase Merge branch 'master' into s3-pr
3cdcbdf8
tjruwase
tjruwase Merge branch 'master' into s3-pr
64d74d1e
tjruwase
zarzen
tjruwase
zarzen
jfc4050
tjruwase Merge branch 'master' into s3-pr
e30e6cca
tjruwase
zarzen
fix indentation after merge
f19593d6
fixes to account for parameter offload
f72bc78a
accounting for torch.cuda.memory_stats not being available
660df05b
moved partition_all_params to optimizer step
4f9477f8
raamjad
raamjad
jeffra Merge branch 'master' into s3-pr
818651c6
tjruwase
jfc4050 Merge branch 'master' into s3-pr
f681201b
allgathering on params before item gets called
bb34f901
fix param status checks
9f3b5043
fix grad accumulation with optimizer offload
1772d410
grad norm computation fix for optimizer offload
5f213d8c
change post divide in reduce-scatter to pre divide
31988054
fix gradient race condition w/ optimizer offload
2225659c
improve inf/nan gradient tracking
5aa9bd50
don't prefetch when not in training mode
a1a60ed4
format fix after merging
df416593
raamjad
jfc4050
tjruwase
tjruwase
fix prefetching issue when using NVME offload
ab3a82af
jfc4050
stas00
jfc4050
tjruwase Merge branch 'master' into s3-pr
025a41e6
tjruwase
tjruwase
jfc4050 Merge branch 'master' into s3-pr
6f9415b9
tjruwase
szhengac
szhengac commented on 2021-11-02
jfc4050 Merge branch 'master' into s3-pr
8d122812
improved defragmentation for fp16 parameters
a26d1fb9
relative imports for bf16 tests
937f04e1
changes for bwd compatibility with pytorch 1.2
e74f5099
remove buffered_reduce_fallback
6ee558d1
removed unused parameter offset bookkeeping
14e22a25
fixed tracking for multiple param groups
16281df2
tjruwase Merge branch 'master' into s3-pr
38af6b18
unbroke bfloat16 config after merge conflict
cc7011ec
using base allgather params when only 1 param
806b0726
cleanup/fixes for fp16 partition defragmentation
bf0dd663
manuelciosici
manuelciosici commented on 2021-11-04
jfc4050
tjruwase Merge branch 'master' into s3-pr
73207aee
tjruwase
tjruwase commented on 2021-11-05
tjruwase Merge branch 'master' into s3-pr
d3ecb1fc
tjruwase
tjruwase
jfc4050
tjruwase Merge branch 'master' into s3-pr
812fe679
jfc4050
jfc4050
tjruwase
tjruwase
jeffra switch to CRLF
6dc21a60
jeffra
jeffra convert to same new-line style as master
2a383020
jeffra align new line with master
16f1d21d
bazpasha
bazpasha commented on 2021-11-19
bazpasha
bazpasha commented on 2021-11-19
tjruwase Merge branch 'master' into s3-pr
11d590a0
tjruwase Fix merge issues
2b5f6ea2
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-24
tjruwase
tjruwase commented on 2021-11-24
tjruwase
tjruwase commented on 2021-11-24
tjruwase Merge branch 'master' into s3-pr
80b53d31
tjruwase Merge branch 'master' into s3-pr
6dfe6938
jeffra switch to CRLF
912e6f04
tjruwase
tjruwase commented on 2021-11-29
jeffra fix to LF line endings
4b0133b6
jeffra minor merge fixes
b998206e
remove extra bfloat16_enabled definition
d6deecb3
asserting params inflight for AllGatherHandle
2a4ef29a
remove get_cuda_mem_allocated_str
90182b66
stas00
stas00
stas00
jfc4050
stas00
jfc4050
tjruwase Merge branch 'master' into s3-pr
ad847edb
tjruwase Format fixes
f590ba45
tjruwase
tjruwase commented on 2021-12-08
fix bfloat16 zero stage check (broken after merge commit)
9db815fe
jfc4050
tjruwase
tjruwase +self.communication_data_type, -self.allreduce_always_fp32; delete de…
259ec153
jfc4050
tjruwase Add self.reduce_scatter
96d22471
tjruwase Merge branch 'master' into s3-pr
2630b756
stas00
stas00
tjruwase Merge branch 'master' into s3-pr
79fd42cd
jeffra Merge branch 'master' into s3-pr
8565e043
jeffra
stas00
jeffra
stas00
tjruwase Merge branch 'master' into s3-pr
06eab1ac
tjruwase Format fix
0f8affe3
tjruwase Merge branch 'master' into s3-pr
3436422e
tjruwase Fix merge issues
601d1f19
tjruwase Merge branch 's3-pr' of github.com:jfc4050/DeepSpeed into s3-pr
5dcee36b
tjruwase Merge branch 'master' into s3-pr
580d25ef
jfc4050
jeffra Merge branch 'master' into s3-pr
872f4513
tjruwase Merge branch 'master' into s3-pr
e236293b
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase Merge branch 'master' into s3-pr
43b3b83b
tjruwase
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase requested changes on 2022-01-11
jfc4050
tjruwase Merge branch 'master' into s3-pr
83905ac1
jfc4050
iterate over params_to_fetch rather than make another iterator
31aecfca
add some TODOs
8736700e
tjruwase Merge branch 'master' into s3-pr
516379de
remove unnecessary division by micro_step_id
0bf7bcde
rename config keys "bfloat16" -> "bf16"
43c00ff7
rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bi…
4574bc71
jfc4050 jfc4050 force pushed from 55924bc6 to 4574bc71 3 years ago
stas00
stas00 commented on 2022-01-19
add unit test to check backwards compatibility for gather_16bit_weights
e04dc6a2
added test to confirm bf16 key bwd compatibility
391cecf7
jfc4050
stas00
tjruwase Merge branch 'master' into s3-pr
3d264694
tjruwase Format fixes
536d1718
tjruwase
tjruwase approved these changes on 2022-01-19
tjruwase
jfc4050
tjruwase Merge branch 'master' into s3-pr
19f35382
tjruwase
tjruwase tjruwase merged 4912e0ad into master 3 years ago
stas00
bliu3650
bliu3650 commented on 2023-05-12

Login to write a write a comment.

Login via GitHub