DeepSpeed
Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs
#6964

Open

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

gyou2021 wants to merge 64 commits into deepspeedai:master from gyou2021:autoTP_Qwen2Moe_DeepSeekv2

gyou2021 requested a review from

hwchen2017 247 days ago

gyou2021 requested a review from

loadams 247 days ago

delock commented on 2025-01-21

Reduced the experts allreduce number per layer to ONCE for the Qwen2-…

c9b12af9

Fixed format

590ea36a

Removed print

889c2750

Fix a bug about set.

2ec6c347

Add the missing view operations from sequence parallel(async). (#6750)

504d696f

Update `torch.norm` to `torch.linalg.norm` and `torch.linalg.vector_n…

c266dc98

Using explicit GPU upcast for ZeRO-Offload (#6962)

ae129935

Update version.txt after 0.16.3 release (#6965)

deb09a3b

Precisely track nvme optimizer offload (#6963)

128d436e

Update build_win.bat script to exclue GDS op as it lacks Windows supp…

864472b3

Add CUDA 12.8 support and comment on CUDA 12.7 (#6975)

1ac398c1

Update torch versions to support 2.6 (#6977)

eda53d8b

generalize deepspeed linear and implement it for non cuda systems (#6…

112a7c6a

Update recommended Windows whl building versions (#6983)

7d2c5fec

Title: Fix setup_env_ranks to Properly Set Environment Variables Inst…

f1d326c2

Specify torchvision in nv-ds-chat workflow (prevents errors with torc…

46545d77

Remove assumption that padding only occurs on last rank (#6974)

af1ba94e

Use ds-specific module id to avoid conflicts (#6847)

e235921f

Update A6000 workflows to use newer docker container - 24.09 vs 24.03…

f5e97963

Allow NVIDIA Blackwell (#6991)

07634b96

Update GH org references (#6998)

0e57fa02

Update CNAME

e86c0c30

Update CNAME

0d7f0eb0

[XPU] max1100 workflow update for docker and softwares (#7003)

cd8a9887

autotp training(fix dco) (#7004)

18c712fc

import triton files when triton is supported and installed (#6989)

c5bf6f64

Update A6000 tests transformers version (#7016)

590de5fe

Fix ds-chat CI regression (#7015)

693c39ff

[Ulysses tutorial] typos (#7024)

322a05a6

fix hostname -I for macOS #6497 (#6990)

8869d789

Update workflows to cuda 12.4 (#7000)

e4d03af5

[ROCm] Enable fp_quantizer on ROCm (#7027)

8c6251da

add gds chinese blog (#7034)

e3e179ca

Add chinese blog for deepspeed windows, and fix format (#7035)

fd2787b3

AIO on ROCM (#7023)

ba8ef574

Control trace cache warnings (#7039)

f4b0f586

Update CUDA compute capability to support Blackwell (#7047)

3ca3e2fb

Update setup.py handling of ROCm cupy (#7051)

56127786

nv-ds-chat breaks with latest transformers (#7052)

af8c1900

Rename aio_thread_count to intra_op_parallelism (#7056)

225471ad

add autoTP training zero2 tests (#7049)

1df293a6

Fix, bf16 optimizer remove dup loop (#7054)

94abf682

Update version.txt after 0.16.4 release (#7063)

4a4ff9ba

fix an outdated doc wrt CUDA_VISIBLE_DEVICES (#7058)

e5eda47f

Tecorigin sdaa accelerator (#6903)

675ec9af

Handle special case of libuv for Windows (#7064)

81c1fee8

Update README with info on newest accelerator (#7065)

17f544cb

Bug Fix for offload_states API (#7050)

20fd872c

Fix TOCTOU issues, switch to fstat (#7067)

0b289a26

config torch to avoid graph breaks caused by logger (#6999)

4a86d02e

Fix meta load tensor imcompatible issue (#7073)

594b5bb1

Replace calls to `python setup.py sdist` with `python -m build --sdis…

a843e399

Revert "Handle special case of libuv for Windows (#7064)" (#7076)

4cbc52c0

Add DeepseekV3 AutoTP. (#7045)

586e4366

Improve inference tutorial docs (#7083)

5e379ada

Added support for the environment variable DS_MOE_EXPERTS_REDUCE_ONCE…

13bf8662

Changed env variable name to 'DS_MOE_TP_SINGLE_ALLREDUCE'

d5115bed

Pin transformers version on tests that use latest. (#7085)

f0044cbc

Update README.md with ICS '23 MoE paper link (#7087)

16ad5fd7

Update parallelism for nv-torch-latest/nightly tests due to more GPUs…

47d4420f

Remove workflows for very old torch versions (#7090)

b3c64dd3

gyou2021 force pushed from f3c6b431 to b3c64dd3 209 days ago

gyou2021 requested a review from

tjruwase 209 days ago

gyou2021 requested a review from

tohtana 209 days ago

gyou2021 requested a review from

jomayeri 209 days ago

gyou2021 requested a review from

GuanhuaWang 209 days ago

gyou2021 changed the title ~~Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs~~ Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs 209 days ago

Fixed conflicts

9b1fe98d

hwchen2017 commented on 2025-03-03

Update auto_tp.py

6b96dd9e

Merge branch 'master' into autoTP_Qwen2Moe_DeepSeekv2

e7883e7a

Reviewers

hwchen2017

delock

loadams

tjruwase

tohtana

jomayeri

GuanhuaWang

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

DeepSpeed Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964 Open

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

DeepSpeed
Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs
#6964

Open