DeepSpeed
Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support)
#1453
Merged

Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) #1453

tjruwase merged 91 commits into deepspeedai:master from jfc4050:s3-pr
jfc4050
raamjad Changes for bfloat16 Zero2
fe264232
jfc4050 ZeRO stage3 optimizations, with some bug fixes
8864f911
jfc4050 jfc4050 requested a review from awan-10 awan-10 4 years ago
jfc4050 jfc4050 requested a review from cli99 cli99 4 years ago
jfc4050 jfc4050 requested a review from conglongli conglongli 4 years ago
jfc4050 jfc4050 requested a review from eltonzheng eltonzheng 4 years ago
jfc4050 jfc4050 requested a review from jeffra jeffra 4 years ago
jfc4050 jfc4050 requested a review from minjiaz minjiaz 4 years ago
jfc4050 jfc4050 requested a review from niumanar niumanar 4 years ago
jfc4050 jfc4050 requested a review from RezaYazdaniAminabadi RezaYazdaniAminabadi 4 years ago
jfc4050 jfc4050 requested a review from samyam samyam 4 years ago
jfc4050 jfc4050 requested a review from ShadenSmith ShadenSmith 4 years ago
jfc4050 jfc4050 requested a review from tjruwase tjruwase 4 years ago
ghost
tjruwase
tjruwase
tjruwase
jfc4050
jfc4050 fix import in ut
e66aedc2
jfc4050 ran yapf
350a7a02
tjruwase
stas00
stas00 commented on 2021-10-12
stas00
jfc4050
tjruwase
jfc4050
zarzen
zarzen commented on 2021-10-13
tjruwase Merge branch 'master' into s3-pr
b37a4f01
jfc4050 improvements to cache flush warn log
f3839476
jfc4050 backwards compatibility with older versions of pytorch
b2a1c954
jfc4050 handle edge case where reduced tensor smaller than world size
d8678fa7
jfc4050 moved event synchronization to allgather handle wait() call
a0faca0b
jfc4050 removed unnecessary barrier call
bf20c90c
jfc4050 Merge branch 'master' into s3-pr
a353017c
zarzen
zarzen commented on 2021-10-14
jfc4050 formatting fix after resolving merge conflict
c51ba461
jfc4050 skip nvme prefetch when trace not complete
ff01f5cc
jfc4050 opportunistically avoid memory allocation in allgather coalesced wher…
13093eb8
tjruwase Merge branch 'master' into s3-pr
3cdcbdf8
tjruwase
tjruwase Merge branch 'master' into s3-pr
64d74d1e
tjruwase
zarzen
tjruwase
zarzen
jfc4050
tjruwase Merge branch 'master' into s3-pr
e30e6cca
tjruwase
zarzen
jfc4050 fix indentation after merge
f19593d6
jfc4050 fixes to account for parameter offload
f72bc78a
jfc4050 accounting for torch.cuda.memory_stats not being available
660df05b
jfc4050 moved partition_all_params to optimizer step
4f9477f8
raamjad
raamjad
jeffra Merge branch 'master' into s3-pr
818651c6
tjruwase
jfc4050 Merge branch 'master' into s3-pr
f681201b
jfc4050 allgathering on params before item gets called
bb34f901
jfc4050 fix param status checks
9f3b5043
jfc4050 fix grad accumulation with optimizer offload
1772d410
jfc4050 grad norm computation fix for optimizer offload
5f213d8c
jfc4050 change post divide in reduce-scatter to pre divide
31988054
jfc4050 fix gradient race condition w/ optimizer offload
2225659c
jfc4050 improve inf/nan gradient tracking
5aa9bd50
jfc4050 don't prefetch when not in training mode
a1a60ed4
jfc4050 format fix after merging
df416593
raamjad
jfc4050
tjruwase
tjruwase
jfc4050 fix prefetching issue when using NVME offload
ab3a82af
jfc4050
stas00
jfc4050
tjruwase Merge branch 'master' into s3-pr
025a41e6
tjruwase
tjruwase
jfc4050 Merge branch 'master' into s3-pr
6f9415b9
tjruwase
szhengac
szhengac commented on 2021-11-02
jfc4050 Merge branch 'master' into s3-pr
8d122812
jfc4050 improved defragmentation for fp16 parameters
a26d1fb9
jfc4050 relative imports for bf16 tests
937f04e1
jfc4050 changes for bwd compatibility with pytorch 1.2
e74f5099
jfc4050 remove buffered_reduce_fallback
6ee558d1
jfc4050 removed unused parameter offset bookkeeping
14e22a25
jfc4050 fixed tracking for multiple param groups
16281df2
tjruwase Merge branch 'master' into s3-pr
38af6b18
jfc4050 unbroke bfloat16 config after merge conflict
cc7011ec
jfc4050 using base allgather params when only 1 param
806b0726
jfc4050 cleanup/fixes for fp16 partition defragmentation
bf0dd663
manuelciosici
manuelciosici commented on 2021-11-04
jfc4050
tjruwase Merge branch 'master' into s3-pr
73207aee
tjruwase
tjruwase commented on 2021-11-05
tjruwase Merge branch 'master' into s3-pr
d3ecb1fc
tjruwase
tjruwase
jfc4050
tjruwase Merge branch 'master' into s3-pr
812fe679
jfc4050
jfc4050
tjruwase
tjruwase
jeffra switch to CRLF
6dc21a60
jeffra
jeffra convert to same new-line style as master
2a383020
jeffra align new line with master
16f1d21d
bazpasha
bazpasha commented on 2021-11-19
bazpasha
bazpasha commented on 2021-11-19
tjruwase Merge branch 'master' into s3-pr
11d590a0
tjruwase Fix merge issues
2b5f6ea2
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-23
tjruwase
tjruwase commented on 2021-11-24
tjruwase
tjruwase commented on 2021-11-24
tjruwase
tjruwase commented on 2021-11-24
tjruwase Merge branch 'master' into s3-pr
80b53d31
tjruwase Merge branch 'master' into s3-pr
6dfe6938
jeffra switch to CRLF
912e6f04
tjruwase
tjruwase commented on 2021-11-29
jeffra fix to LF line endings
4b0133b6
jeffra minor merge fixes
b998206e
jfc4050 remove extra bfloat16_enabled definition
d6deecb3
jfc4050 asserting params inflight for AllGatherHandle
2a4ef29a
jfc4050 remove get_cuda_mem_allocated_str
90182b66
stas00
stas00
stas00
jfc4050
stas00
jfc4050
tjruwase Merge branch 'master' into s3-pr
ad847edb
tjruwase Format fixes
f590ba45
tjruwase
tjruwase commented on 2021-12-08
jfc4050 fix bfloat16 zero stage check (broken after merge commit)
9db815fe
jfc4050
tjruwase
tjruwase +self.communication_data_type, -self.allreduce_always_fp32; delete de…
259ec153
jfc4050
tjruwase Add self.reduce_scatter
96d22471
tjruwase Merge branch 'master' into s3-pr
2630b756
stas00
stas00
tjruwase Merge branch 'master' into s3-pr
79fd42cd
jeffra Merge branch 'master' into s3-pr
8565e043
jeffra
stas00
jeffra
stas00
tjruwase Merge branch 'master' into s3-pr
06eab1ac
tjruwase Format fix
0f8affe3
tjruwase Merge branch 'master' into s3-pr
3436422e
tjruwase Fix merge issues
601d1f19
tjruwase Merge branch 's3-pr' of github.com:jfc4050/DeepSpeed into s3-pr
5dcee36b
tjruwase Merge branch 'master' into s3-pr
580d25ef
jfc4050
jeffra Merge branch 'master' into s3-pr
872f4513
tjruwase Merge branch 'master' into s3-pr
e236293b
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase Merge branch 'master' into s3-pr
43b3b83b
tjruwase
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase commented on 2022-01-11
tjruwase
tjruwase requested changes on 2022-01-11
jfc4050
tjruwase Merge branch 'master' into s3-pr
83905ac1
jfc4050
jfc4050 iterate over params_to_fetch rather than make another iterator
31aecfca
jfc4050 add some TODOs
8736700e
tjruwase Merge branch 'master' into s3-pr
516379de
jfc4050 remove unnecessary division by micro_step_id
0bf7bcde
jfc4050 rename config keys "bfloat16" -> "bf16"
43c00ff7
jfc4050 rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bi…
4574bc71
jfc4050 jfc4050 force pushed from 55924bc6 to 4574bc71 4 years ago
stas00
stas00 commented on 2022-01-19
jfc4050 add unit test to check backwards compatibility for gather_16bit_weights
e04dc6a2
jfc4050 added test to confirm bf16 key bwd compatibility
391cecf7
jfc4050
stas00
tjruwase Merge branch 'master' into s3-pr
3d264694
tjruwase Format fixes
536d1718
tjruwase
tjruwase approved these changes on 2022-01-19
tjruwase
jfc4050
tjruwase Merge branch 'master' into s3-pr
19f35382
tjruwase
tjruwase tjruwase merged 4912e0ad into master 4 years ago
stas00
bliu3650
bliu3650 commented on 2023-05-12

Login to write a write a comment.

Login via GitHub