DeepSpeed
Improve overflow handling in ZeRO
#6976
Merged

Improve overflow handling in ZeRO #6976

loadams merged 99 commits into master from olruwase/ds_5241
tjruwase
tjruwase Improve overflow handling in ZeRO
a3a18f72
tjruwase tjruwase requested a review from tohtana tohtana 333 days ago
tjruwase tjruwase requested a review from loadams loadams 333 days ago
tjruwase Unit test and pydantic configuration
19431f80
tjruwase Formatting fixes
406cf26f
tjruwase Merge branch 'master' into olruwase/ds_5241
35570f54
tjruwase Remove unused symbol
cb784448
tjruwase Fix typo
ee1c1fd0
tjruwase Pydantic fp16 config
0b2cf73a
tjruwase Fix more typos
c7a90f9f
tjruwase Address #4986
3694e07d
tjruwase Merge branch 'master' into olruwase/ds_5241
2bbcf00f
tjruwase Merge branch 'master' into olruwase/ds_5241
c1b87ead
tjruwase
tjruwase Merge branch 'master' into olruwase/ds_5241
5da6cd0f
loadams Merge branch 'master' into olruwase/ds_5241
a65d20c9
loadams
loadams commented on 2025-01-30
tjruwase Fix typo
ae039b29
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
04461922
tjruwase Merge branch 'master' into olruwase/ds_5241
5d48745d
loadams Merge branch 'master' into olruwase/ds_5241
05c362d9
loadams Merge branch 'master' into olruwase/ds_5241
5e17ed67
delock
tjruwase Merge branch 'master' into olruwase/ds_5241
06bb3a61
tjruwase Fix min loss scale
0d0ab3d4
tjruwase Merge branch 'master' into olruwase/ds_5241
cccd5b11
tjruwase Fix UTs
2c6f6307
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
21bfca08
tjruwase Merge branch 'master' into olruwase/ds_5241
5fe58101
xylian86 Using explicit GPU upcast for ZeRO-Offload (#6962)
732ceb7c
loadams Update version.txt after 0.16.3 release (#6965)
db9aff9f
tjruwase Precisely track nvme optimizer offload (#6963)
4edeb033
loadams Update build_win.bat script to exclue GDS op as it lacks Windows supp…
f00f4ea5
tjruwase Improve overflow handling in ZeRO
c3846faa
tjruwase Unit test and pydantic configuration
7d56ffa9
tjruwase Formatting fixes
6ca11efa
loadams Add CUDA 12.8 support and comment on CUDA 12.7 (#6975)
49f3df86
loadams Update torch versions to support 2.6 (#6977)
8364b125
tjruwase Remove unused symbol
ea9b4732
tjruwase Fix typo
d2425a2a
tjruwase Pydantic fp16 config
7d5be078
tjruwase Fix more typos
e8fc098a
tjruwase Address #4986
2bbb7b4f
oelayan7 generalize deepspeed linear and implement it for non cuda systems (#6…
3ab5e885
tjruwase Fix typo
271db941
loadams Update recommended Windows whl building versions (#6983)
b1900af1
fabiosanger Title: Fix setup_env_ranks to Properly Set Environment Variables Inst…
e3d10e5a
loadams Specify torchvision in nv-ds-chat workflow (prevents errors with torc…
b8d8e390
xylian86 Remove assumption that padding only occurs on last rank (#6974)
fde7df1f
tjruwase Use ds-specific module id to avoid conflicts (#6847)
b0b01321
loadams Update A6000 workflows to use newer docker container - 24.09 vs 24.03…
353ab08b
fabiendupont Allow NVIDIA Blackwell (#6991)
14189a72
tjruwase Update GH org references (#6998)
75996f89
tjruwase Fix min loss scale
b23c545c
tjruwase Fix UTs
7cd3a9f9
loadams Update CNAME
2c5629e0
loadams Update CNAME
6b156883
Liangliang-Ma [XPU] max1100 workflow update for docker and softwares (#7003)
3773d837
inkcherry autotp training(fix dco) (#7004)
64c4b04c
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
5fa29105
tjruwase Merge branch 'master' into olruwase/ds_5241
1f5a672a
tjruwase Fix ds-chat CI regression
98821161
tjruwase Merge branch 'olruwase/ds_7014' of github.com:microsoft/DeepSpeed int…
97d79158
tjruwase tjruwase requested a review from hwchen2017 hwchen2017 323 days ago
tjruwase Fix bug
4a1dd0fc
tjruwase Avoid naming collision on partition()
0ac44574
tjruwase Merge branch 'master' into olruwase/ds_5241
1597d48b
tjruwase Use new API
2ae20626
tjruwase Merge branch 'master' into olruwase/ds_7014
9fb73a4d
tjruwase Merge branch 'olruwase/ds_7014' of github.com:microsoft/DeepSpeed int…
26fa8af3
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
b565d778
loadams Merge branch 'master' into olruwase/ds_5241
d098c322
tjruwase Merge branch 'master' into olruwase/ds_5241
990a5ad8
tjruwase Merge branch 'master' into olruwase/ds_5241
9b1b030b
tjruwase Merge branch 'master' into olruwase/ds_5241
1953c38f
tjruwase Code cleanup
2ea182ef
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
9aff2087
tjruwase Merge branch 'master' into olruwase/ds_5241
36c55d24
tjruwase Merge branch 'master' into olruwase/ds_5241
80fcb83b
tjruwase Merge branch 'master' into olruwase/ds_5241
776385fc
tjruwase Merge branch 'master' into olruwase/ds_5241
e5f64af1
tjruwase Use new dlpack api; Formatting fixes
61685dc0
tjruwase Merge branch 'olruwase/new_dlpack_api' of github.com:microsoft/DeepSp…
75ac86cf
tjruwase tjruwase requested a review from jomayeri jomayeri 299 days ago
tjruwase Merge branch 'master' into olruwase/ds_5241
6b9736c5
tjruwase Triage pytest --forked cupy failure
83850adb
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
4d56c995
tjruwase Revert pytest debugging
5e76c7dd
loadams Merge branch 'master' into olruwase/ds_5241
a59cb55f
loadams Merge branch 'master' into olruwase/ds_5241
f10a2f21
tjruwase Merge branch 'master' into olruwase/ds_5241
919f5385
tjruwase Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…
4b583262
tjruwase Merge branch 'olruwase/ds_5241' of github.com:microsoft/DeepSpeed int…
75203d72
tjruwase UT workaround
08a07cbc
tjruwase Merge branch 'master' into olruwase/ds_5241
728dd387
tjruwase Merge branch 'master' into olruwase/ds_5241
2ac92112
tjruwase Merge branch 'master' into olruwase/ds_5241
2d6913a1
tjruwase Merge branch 'master' into olruwase/ds_5241
55395db3
loadams Merge branch 'master' into olruwase/ds_5241
1f38d597
sayakpaul
sfc-gh-truwase Merge branch 'master' into olruwase/ds_5241
58e61a04
tjruwase Merge branch 'master' into olruwase/ds_5241
fa30042b
tjruwase Merge branch 'master' into olruwase/ds_5241
e76bfd83
tjruwase
loadams Merge branch 'master' into olruwase/ds_5241
9b4289a1
tjruwase Merge branch 'master' into olruwase/ds_5241
16bcd901
tjruwase Merge branch 'master' into olruwase/ds_5241
6373a578
loadams
loadams approved these changes on 2025-06-09
loadams Merge branch 'master' into olruwase/ds_5241
99f356dd
loadams loadams enabled auto-merge (squash) 201 days ago
loadams loadams merged e440506b into master 201 days ago
loadams loadams deleted the olruwase/ds_5241 branch 201 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone