DeepSpeed
Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs
#6964
Open

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

gyou2021
gyou2021 gyou2021 requested a review from hwchen2017 hwchen2017 247 days ago
gyou2021 gyou2021 requested a review from loadams loadams 247 days ago
delock
delock
delock commented on 2025-01-21
gyou2021
delock
gyou2021
delock
gyou2021 Reduced the experts allreduce number per layer to ONCE for the Qwen2-…
c9b12af9
gyou2021 Fixed format
590ea36a
gyou2021 Removed print
889c2750
gyou2021 Fix a bug about set.
2ec6c347
inkcherry Add the missing view operations from sequence parallel(async). (#6750)
504d696f
loadams Update `torch.norm` to `torch.linalg.norm` and `torch.linalg.vector_n…
c266dc98
xylian86 Using explicit GPU upcast for ZeRO-Offload (#6962)
ae129935
loadams Update version.txt after 0.16.3 release (#6965)
deb09a3b
tjruwase Precisely track nvme optimizer offload (#6963)
128d436e
loadams Update build_win.bat script to exclue GDS op as it lacks Windows supp…
864472b3
loadams Add CUDA 12.8 support and comment on CUDA 12.7 (#6975)
1ac398c1
loadams Update torch versions to support 2.6 (#6977)
eda53d8b
oelayan7 generalize deepspeed linear and implement it for non cuda systems (#6…
112a7c6a
loadams Update recommended Windows whl building versions (#6983)
7d2c5fec
fabiosanger Title: Fix setup_env_ranks to Properly Set Environment Variables Inst…
f1d326c2
loadams Specify torchvision in nv-ds-chat workflow (prevents errors with torc…
46545d77
xylian86 Remove assumption that padding only occurs on last rank (#6974)
af1ba94e
tjruwase Use ds-specific module id to avoid conflicts (#6847)
e235921f
loadams Update A6000 workflows to use newer docker container - 24.09 vs 24.03…
f5e97963
fabiendupont Allow NVIDIA Blackwell (#6991)
07634b96
tjruwase Update GH org references (#6998)
0e57fa02
loadams Update CNAME
e86c0c30
loadams Update CNAME
0d7f0eb0
Liangliang-Ma [XPU] max1100 workflow update for docker and softwares (#7003)
cd8a9887
inkcherry autotp training(fix dco) (#7004)
18c712fc
oelayan7 import triton files when triton is supported and installed (#6989)
c5bf6f64
loadams Update A6000 tests transformers version (#7016)
590de5fe
tjruwase Fix ds-chat CI regression (#7015)
693c39ff
stas00 [Ulysses tutorial] typos (#7024)
322a05a6
fitzjalen fix hostname -I for macOS #6497 (#6990)
8869d789
loadams Update workflows to cuda 12.4 (#7000)
e4d03af5
rraminen [ROCm] Enable fp_quantizer on ROCm (#7027)
8c6251da
GuanhuaWang add gds chinese blog (#7034)
e3e179ca
hwchen2017 Add chinese blog for deepspeed windows, and fix format (#7035)
fd2787b3
jomayeri AIO on ROCM (#7023)
ba8ef574
tjruwase Control trace cache warnings (#7039)
f4b0f586
hwchen2017 Update CUDA compute capability to support Blackwell (#7047)
3ca3e2fb
loadams Update setup.py handling of ROCm cupy (#7051)
56127786
loadams nv-ds-chat breaks with latest transformers (#7052)
af8c1900
tjruwase Rename aio_thread_count to intra_op_parallelism (#7056)
225471ad
inkcherry add autoTP training zero2 tests (#7049)
1df293a6
wukong1992 Fix, bf16 optimizer remove dup loop (#7054)
94abf682
loadams Update version.txt after 0.16.4 release (#7063)
4a4ff9ba
stas00 fix an outdated doc wrt CUDA_VISIBLE_DEVICES (#7058)
e5eda47f
siqi654321 Tecorigin sdaa accelerator (#6903)
675ec9af
loadams Handle special case of libuv for Windows (#7064)
81c1fee8
loadams Update README with info on newest accelerator (#7065)
17f544cb
U-rara Bug Fix for offload_states API (#7050)
20fd872c
loadams Fix TOCTOU issues, switch to fstat (#7067)
0b289a26
ShellyNR config torch to avoid graph breaks caused by logger (#6999)
4a86d02e
Yejing-Lai Fix meta load tensor imcompatible issue (#7073)
594b5bb1
loadams Replace calls to `python setup.py sdist` with `python -m build --sdis…
a843e399
loadams Revert "Handle special case of libuv for Windows (#7064)" (#7076)
4cbc52c0
Yejing-Lai Add DeepseekV3 AutoTP. (#7045)
586e4366
loadams Improve inference tutorial docs (#7083)
5e379ada
gyou2021 Added support for the environment variable DS_MOE_EXPERTS_REDUCE_ONCE…
13bf8662
gyou2021 Changed env variable name to 'DS_MOE_TP_SINGLE_ALLREDUCE'
d5115bed
loadams Pin transformers version on tests that use latest. (#7085)
f0044cbc
siddharth9820 Update README.md with ICS '23 MoE paper link (#7087)
16ad5fd7
loadams Update parallelism for nv-torch-latest/nightly tests due to more GPUs…
47d4420f
loadams Remove workflows for very old torch versions (#7090)
b3c64dd3
gyou2021 gyou2021 force pushed from f3c6b431 to b3c64dd3 209 days ago
gyou2021 gyou2021 requested a review from tjruwase tjruwase 209 days ago
gyou2021 gyou2021 requested a review from tohtana tohtana 209 days ago
gyou2021 gyou2021 requested a review from jomayeri jomayeri 209 days ago
gyou2021 gyou2021 requested a review from GuanhuaWang GuanhuaWang 209 days ago
gyou2021 gyou2021 changed the title Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs 209 days ago
gyou2021 Fixed conflicts
9b1fe98d
hwchen2017
hwchen2017 commented on 2025-03-03
gyou2021 Update auto_tp.py
6b96dd9e
hwchen2017 Merge branch 'master' into autoTP_Qwen2Moe_DeepSeekv2
e7883e7a

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone