vllm
[Core] Add Helix (Context + Tensor) Parallelism
#34024
Open
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
39
Changes
View On
GitHub
[Core] Add Helix (Context + Tensor) Parallelism
#34024
sungsooha
wants to merge 39 commits into
vllm-project:main
from
sungsooha:helix-parallelism
sungsooha
requested a review
from
mgoin
1 day ago
sungsooha
requested a review
from
pavanimajety
1 day ago
sungsooha
requested a review
from
LucasWilkinson
1 day ago
sungsooha
requested a review
from
WoosukKwon
1 day ago
sungsooha
requested a review
from
youkaichao
1 day ago
sungsooha
requested a review
from
robertgshaw2-redhat
1 day ago
sungsooha
requested a review
from
tlrmchlsmth
1 day ago
sungsooha
requested a review
from
houseroad
1 day ago
sungsooha
requested a review
from
hmellor
1 day ago
sungsooha
requested a review
from
yewentao256
1 day ago
sungsooha
requested a review
from
ProExpertProg
1 day ago
mergify
added
documentation
mergify
added
llama
mergify
added
nvidia
mergify
added
v1
gemini-code-assist
commented on 2026-02-06
[Helix] Add Helix parallelism for decode context parallel
66d4e6c5
[Helix] Add full GQA support and FlashInfer/MLA integration
9ce51f3b
[Helix] Add GQA model support with proper head distribution
c1b08fe5
[Helix] Add helix_mode and DCP to engine init log
13b71d9f
[Helix] Add unit and integration tests
791ae8e6
[Helix] Add functional tests and documentation
2f62b4d3
[Helix] Add functional tests and documentation
c261a6fa
[Helix] Fix CUDA fork issue in functional tests
a82e5192
fix(tests): simplify GPU check to match vLLM test patterns
f4cd9454
fix(tests): use multi_gpu_test decorator for proper GPU detection
334d512f
[Bugfix] Restore DCP->PIECEWISE CUDA graph check
3aaaa044
[Helix] Fix FlashInfer GQA mode head count configuration
b9f9479b
[Helix] Skip FlashInferMLA when DCP enabled (no LSE support)
7891e1af
[Helix] Fix FlashInfer num_qo_heads computation for GQA mode
0e96ac3f
fix: use vllm_config.parallel_config instead of self.parallel_config
9e630662
fix: access total_num_attention_heads via model_arch_config
cb475429
fix: compute FlashInfer head counts at build() time for Helix GQA
2cc76460
fix: add missing .contiguous() in FlashInfer Helix GQA decode path
8aa16cdf
fix: pass is_lse_base_on_e=False for FlashInfer Helix paths
0977bf0f
fix: explicitly set CUTLASS_MLA backend when DCP is enabled
002db495
fix: move lse_query transpose AFTER head scatter in Helix GQA prefill
1a6639c4
revert: match internal repo FlashInfer Helix implementation exactly
70aede06
fix(flashinfer): use built-in fast_decode_plan instead of custom impl
993ef12d
Revert "fix(flashinfer): use built-in fast_decode_plan instead of cus…
ff44c9f8
feat: add validation to prevent FlashInfer + Helix GQA combination
e3a77f1c
refactor: remove Helix GQA code from FlashInfer backend
6d0f71e7
docs: update Helix documentation and tests for backend compatibility
402953b1
fix: use _qkv_tp_rank in legacy weight_loader for Helix GQA
95a961ca
fix: remove duplicate q_pad_num_heads in CutlassMLAImpl
9a3ab6d8
fix: guard get_current_vllm_config() during torch.compile tracing
9fc96b67
fix: allow FULL CUDA graphs for MLA models with DCP
c793510d
fix: apply Helix All-to-All for MLA decode in forward_impl
1a9ff9e2
perf(helix): add buffer reuse to reduce allocation overhead
a50c1d6c
perf(helix): add packed single-A2A optimization
afbc7a1e
revert: remove A2A optimizations (no measurable benefit)
973781e2
[Cleanup] Remove dead MLACommonImpl.forward() method
46e1fa56
sungsooha
force pushed
from
de681c09
to
46e1fa56
1 day ago
fix(tests): add CPU reference implementation for LSE combine
bb0fdd31
style: fix pre-commit lint issues
51a281d4
fix: pre-commit markdownlint and mypy errors
15daff17
Login to write a write a comment.
Login via GitHub
Reviewers
gemini-code-assist
mgoin
pavanimajety
LucasWilkinson
WoosukKwon
youkaichao
robertgshaw2-redhat
tlrmchlsmth
houseroad
hmellor
yewentao256
ProExpertProg
Assignees
No one assigned
Labels
documentation
v1
llama
nvidia
Milestone
No milestone
Login to write a write a comment.
Login via GitHub