DeepSpeed
Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2
#7421
Merged

Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 #7421

loadams merged 76 commits into deepspeedai:master from LYMDLUT:master
LYMDLUT
LYMDLUT LYMDLUT requested a review from tjruwase tjruwase 314 days ago
LYMDLUT LYMDLUT requested a review from tohtana tohtana 314 days ago
tohtana
LYMDLUT LYMDLUT closed this 313 days ago
LYMDLUT LYMDLUT reopened this 312 days ago
LYMDLUT LYMDLUT requested a review from loadams loadams 312 days ago
LYMDLUT
LYMDLUT
tohtana
LYMDLUT LYMDLUT changed the title Try to support deepspeed offload states with ZeRO2 Try to support deepspeed offload states with ZeRO1 and ZeRO2 304 days ago
LYMDLUT
tohtana
LYMDLUT
LYMDLUT
LYMDLUT
sfc-gh-truwase
sfc-gh-truwase
LYMDLUT
LYMDLUT
LYMDLUT LYMDLUT changed the title Try to support deepspeed offload states with ZeRO1 and ZeRO2 Support deepspeed offload and reload states with ZeRO1 and ZeRO2 272 days ago
LYMDLUT LYMDLUT changed the title Support deepspeed offload and reload states with ZeRO1 and ZeRO2 Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2. 272 days ago
LYMDLUT LYMDLUT changed the title Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2. Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 272 days ago
sfc-gh-truwase
LYMDLUT
sfc-gh-truwase
LYMDLUT Update stage_1_and_2.py
c723c328
LYMDLUT Update stage_1_and_2.py
b808560d
LYMDLUT Update engine.py
2e2e0583
LYMDLUT Update offload_states.py
1740fe95
deepcharm Align missing argument in AllReduceCoalescedHandle (#7414)
6f6bba44
alexk101 Improvements to Communication Logger (#7404)
db550fc4
LYMDLUT Add files via upload
1353e684
LYMDLUT Update test_offload_states_zero2.py
f0f1ed7f
LYMDLUT Update test_offload_states_zero2.py
c94fe1c6
LYMDLUT Update stage_1_and_2.py
b1ae7f6e
LYMDLUT Update test_offload_states_zero2.py
17c7f979
LYMDLUT Update stage_1_and_2.py
5c4cfcf7
stas00 trying to fix nv-accelerate-v100.yml CI job (#7424)
cfb37836
LYMDLUT Update stage_1_and_2.py
e9970a42
LYMDLUT Update stage_1_and_2.py
b0ee1b43
saforem2 fix: Propagate `strip_tensor_paddings` (#7426)
d2c19ed6
deepcharm Use past_key_value when provided (#7428)
b7f98fe0
stas00 set `device_id` in torch's `init_process_group` (#7266)
2b9ac51d
stas00 [Ulysses-ALST] add FA3 support (#7430)
bba4756c
stas00 TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422)
7bcca9ac
LYMDLUT Update test_offload_states_zero2.py
267281aa
LYMDLUT Update test_offload_states_zero2.py
bb6769e1
LYMDLUT Update test_offload_states_zero2.py
0348fe9d
LYMDLUT Update stage_1_and_2.py
f69409e4
LYMDLUT Update stage_1_and_2.py
61c84f19
LYMDLUT Update stage_1_and_2.py
fcf950f8
LYMDLUT Update stage_1_and_2.py
5215a444
LYMDLUT Update stage_1_and_2.py
7137e7b9
LYMDLUT Update stage_1_and_2.py
fb2c3699
LYMDLUT Update test_offload_states_zero2.py
277e6261
LYMDLUT Update test_offload_states.py
268a7096
LYMDLUT Update test_offload_states.py
23ed5b8b
LYMDLUT Update stage_1_and_2.py
4e5f24f5
LYMDLUT Update stage_1_and_2.py
8053b8e0
LYMDLUT Update stage_1_and_2.py
1d6327d0
LYMDLUT Update stage_1_and_2.py
5efd58e0
LYMDLUT Update test_offload_states_zero2.py
b24de28b
loadams Remove tests from README that are already removed. (#7441)
abcf2186
stas00 [ALST] fix typo in the url (#7444)
dbc4b7dd
stas00 [ALST] fix typo in the url part2 (#7446)
c605f546
loadams Remove additional unused tests (human-eval) (#7445)
85d5efd1
huanyuqu Fix: Adapt Llama injection policy for newer transformers versions (#7…
3d747ef5
loadams Update version.txt after 0.17.3 release. (#7455)
d3a477e9
weeknan Fix: UnboundLocalError for variable 'dim' about issue (#7449)
f13d098c
stas00 adding TiledFusedLogitsLoss (#7437)
26551631
stas00 `TiledFusedLogitsLoss` bug fix (#7459)
fc9efa0f
loadams Update version.txt after v0.17.4 release
947bdd72
loadams Revert "Update version.txt after v0.17.4 release"
3a11e34e
loadams Update version.txt after v0.17.4 release (#7460)
0e5e1604
PKUWZP Update README.md (#7465)
243f48eb
WoosungMyung Add getter APIs for TP/PP/DP ranks in DeepSpeedEngine (#7427)
27b24f06
NirSonnenschein fix issues raised by Coverity scans (#7431)
4f9a9a04
eternalNight Fix all-gather duplicate params and wrong dtype (#7462)
2255f5fd
lpnpcs fix #7188 (#7371)
984386ce
delock add --bind_cores_to_rank to zero offload tutorial (#7474)
8516f9fc
Antlera Add blog for ZenFlow (#7463)
376c5b7a
sfc-gh-truwase Fix cpu CI (#7481)
36b925ab
stas00 fix `deepspeed --venv_script` (#7469)
732ed3c4
sfc-gh-truwase Modal CI (#7289)
61681ce2
stas00 [UlyssesSPDataLoaderAdapter] fix iterator reset (#7472)
c756078b
stas00 [TiledFusedLogitsLoss] support inference (#7477)
cc5261d0
LYMDLUT Update test_offload_states_zero2.py
fde1035f
LYMDLUT Update test_offload_states_zero2.py
d16aa8a8
LYMDLUT Update stage_1_and_2.py
f5f7d494
AlongWY Fix pre-compile on cpu-only machines (#7168)
8a0d2262
sfc-gh-truwase Enable forked PRs (#7486)
4ffb4426
yao-matrix fix xpu device_id AttributeError issue (#7488)
d0db8f80
Antlera Add Zenflow code for Stage 1 & 2 (#7391)
672f326a
cyyever Fix invalid f-strings (#7457)
91bb16bb
tohtana Fix DeepCompile for PyTorch v2.8 (#7496)
5777e6cb
deepcharm Reduce performance impact of compiler.enable decorator (#7498)
0a6ff078
deepcharm Add index to HPU devices (#7497)
2769d2a3
LYMDLUT Delete tests/unit/runtime/zero/test_offload_states_zero2.py
6ef88b9b
LYMDLUT LYMDLUT force pushed from b9385872 to 6ef88b9b 271 days ago
LYMDLUT LYMDLUT requested a review from jomayeri jomayeri 271 days ago
LYMDLUT LYMDLUT requested a review from hwchen2017 hwchen2017 271 days ago
LYMDLUT LYMDLUT requested a review from GuanhuaWang GuanhuaWang 271 days ago
sfc-gh-truwase Merge branch 'master' into master
f2dcffb0
sfc-gh-truwase
sfc-gh-truwase approved these changes on 2025-08-20
sfc-gh-truwase Format fixes
fbefa4ab
sfc-gh-truwase
sfc-gh-truwase
loadams Merge branch 'master' into master
829b742b
loadams loadams enabled auto-merge (squash) 270 days ago
loadams loadams merged bc8c0db3 into master 270 days ago
tohtana

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone