DeepSpeed
Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support)
#1453
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
91
Changes
View On
GitHub
Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support)
#1453
tjruwase
merged 91 commits into
deepspeedai:master
from
jfc4050:s3-pr
Changes for bfloat16 Zero2
fe264232
ZeRO stage3 optimizations, with some bug fixes
8864f911
jfc4050
requested a review
from
awan-10
4 years ago
jfc4050
requested a review
from
cli99
4 years ago
jfc4050
requested a review
from
conglongli
4 years ago
jfc4050
requested a review
from
eltonzheng
4 years ago
jfc4050
requested a review
from
jeffra
4 years ago
jfc4050
requested a review
from
minjiaz
4 years ago
jfc4050
requested a review
from
niumanar
4 years ago
jfc4050
requested a review
from
RezaYazdaniAminabadi
4 years ago
jfc4050
requested a review
from
samyam
4 years ago
jfc4050
requested a review
from
ShadenSmith
4 years ago
jfc4050
requested a review
from
tjruwase
4 years ago
fix import in ut
e66aedc2
ran yapf
350a7a02
stas00
commented on 2021-10-12
zarzen
commented on 2021-10-13
Merge branch 'master' into s3-pr
b37a4f01
improvements to cache flush warn log
f3839476
backwards compatibility with older versions of pytorch
b2a1c954
handle edge case where reduced tensor smaller than world size
d8678fa7
moved event synchronization to allgather handle wait() call
a0faca0b
removed unnecessary barrier call
bf20c90c
Merge branch 'master' into s3-pr
a353017c
zarzen
commented on 2021-10-14
formatting fix after resolving merge conflict
c51ba461
skip nvme prefetch when trace not complete
ff01f5cc
opportunistically avoid memory allocation in allgather coalesced wher…
13093eb8
Merge branch 'master' into s3-pr
3cdcbdf8
Merge branch 'master' into s3-pr
64d74d1e
Merge branch 'master' into s3-pr
e30e6cca
fix indentation after merge
f19593d6
fixes to account for parameter offload
f72bc78a
accounting for torch.cuda.memory_stats not being available
660df05b
moved partition_all_params to optimizer step
4f9477f8
Merge branch 'master' into s3-pr
818651c6
Merge branch 'master' into s3-pr
f681201b
allgathering on params before item gets called
bb34f901
fix param status checks
9f3b5043
fix grad accumulation with optimizer offload
1772d410
grad norm computation fix for optimizer offload
5f213d8c
change post divide in reduce-scatter to pre divide
31988054
fix gradient race condition w/ optimizer offload
2225659c
improve inf/nan gradient tracking
5aa9bd50
don't prefetch when not in training mode
a1a60ed4
format fix after merging
df416593
fix prefetching issue when using NVME offload
ab3a82af
Merge branch 'master' into s3-pr
025a41e6
Merge branch 'master' into s3-pr
6f9415b9
szhengac
commented on 2021-11-02
Merge branch 'master' into s3-pr
8d122812
improved defragmentation for fp16 parameters
a26d1fb9
relative imports for bf16 tests
937f04e1
changes for bwd compatibility with pytorch 1.2
e74f5099
remove buffered_reduce_fallback
6ee558d1
removed unused parameter offset bookkeeping
14e22a25
fixed tracking for multiple param groups
16281df2
Merge branch 'master' into s3-pr
38af6b18
unbroke bfloat16 config after merge conflict
cc7011ec
using base allgather params when only 1 param
806b0726
cleanup/fixes for fp16 partition defragmentation
bf0dd663
manuelciosici
commented on 2021-11-04
Merge branch 'master' into s3-pr
73207aee
tjruwase
commented on 2021-11-05
Merge branch 'master' into s3-pr
d3ecb1fc
Merge branch 'master' into s3-pr
812fe679
switch to CRLF
6dc21a60
convert to same new-line style as master
2a383020
align new line with master
16f1d21d
bazpasha
commented on 2021-11-19
bazpasha
commented on 2021-11-19
Merge branch 'master' into s3-pr
11d590a0
Fix merge issues
2b5f6ea2
tjruwase
commented on 2021-11-23
tjruwase
commented on 2021-11-23
tjruwase
commented on 2021-11-23
tjruwase
commented on 2021-11-23
tjruwase
commented on 2021-11-23
tjruwase
commented on 2021-11-24
tjruwase
commented on 2021-11-24
tjruwase
commented on 2021-11-24
Merge branch 'master' into s3-pr
80b53d31
Merge branch 'master' into s3-pr
6dfe6938
switch to CRLF
912e6f04
tjruwase
commented on 2021-11-29
fix to LF line endings
4b0133b6
minor merge fixes
b998206e
remove extra bfloat16_enabled definition
d6deecb3
asserting params inflight for AllGatherHandle
2a4ef29a
remove get_cuda_mem_allocated_str
90182b66
Merge branch 'master' into s3-pr
ad847edb
Format fixes
f590ba45
tjruwase
commented on 2021-12-08
fix bfloat16 zero stage check (broken after merge commit)
9db815fe
+self.communication_data_type, -self.allreduce_always_fp32; delete de…
259ec153
Add self.reduce_scatter
96d22471
Merge branch 'master' into s3-pr
2630b756
Merge branch 'master' into s3-pr
79fd42cd
Merge branch 'master' into s3-pr
8565e043
Merge branch 'master' into s3-pr
06eab1ac
Format fix
0f8affe3
Merge branch 'master' into s3-pr
3436422e
Fix merge issues
601d1f19
Merge branch 's3-pr' of github.com:jfc4050/DeepSpeed into s3-pr
5dcee36b
Merge branch 'master' into s3-pr
580d25ef
Merge branch 'master' into s3-pr
872f4513
Merge branch 'master' into s3-pr
e236293b
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
Merge branch 'master' into s3-pr
43b3b83b
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
commented on 2022-01-11
tjruwase
requested changes on 2022-01-11
Merge branch 'master' into s3-pr
83905ac1
iterate over params_to_fetch rather than make another iterator
31aecfca
add some TODOs
8736700e
Merge branch 'master' into s3-pr
516379de
remove unnecessary division by micro_step_id
0bf7bcde
rename config keys "bfloat16" -> "bf16"
43c00ff7
rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bi…
4574bc71
jfc4050
force pushed
from
55924bc6
to
4574bc71
3 years ago
stas00
commented on 2022-01-19
add unit test to check backwards compatibility for gather_16bit_weights
e04dc6a2
added test to confirm bf16 key bwd compatibility
391cecf7
Merge branch 'master' into s3-pr
3d264694
Format fixes
536d1718
tjruwase
approved these changes on 2022-01-19
Merge branch 'master' into s3-pr
19f35382
tjruwase
merged
4912e0ad
into master
3 years ago
bliu3650
commented on 2023-05-12
Login to write a write a comment.
Login via GitHub
Reviewers
tjruwase
stas00
bliu3650
szhengac
bazpasha
manuelciosici
raamjad
zarzen
awan-10
cli99
conglongli
eltonzheng
jeffra
minjiaz
niumanar
RezaYazdaniAminabadi
samyam
ShadenSmith
Assignees
No one assigned
Labels
None yet
Milestone
No milestone
Login to write a write a comment.
Login via GitHub