vllm
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2
#34917
Merged

[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 #34917

LopezCastroRoberto
LopezCastroRoberto add custom concat kernel
e9831552
LopezCastroRoberto LopezCastroRoberto requested a review from pavanimajety pavanimajety 129 days ago
LopezCastroRoberto LopezCastroRoberto changed the title add custom concat kernel [Perf] Replace torch.cat with vectorized CUDA kernel for MLA query concat 129 days ago
LopezCastroRoberto LopezCastroRoberto marked this pull request as draft 129 days ago
mergify mergify added nvidia
mergify mergify added v1
LopezCastroRoberto LopezCastroRoberto changed the title [Perf] Replace torch.cat with vectorized CUDA kernel for MLA query concat [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat 129 days ago
gemini-code-assist
gemini-code-assist commented on 2026-02-19
LopezCastroRoberto add tests
0ce15f9d
LopezCastroRoberto add helper vec instructions
09150f91
LopezCastroRoberto LopezCastroRoberto changed the title [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat [Perf][WIP] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 124 days ago
LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review 124 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from mgoin mgoin 124 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from tlrmchlsmth tlrmchlsmth 124 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from WoosukKwon WoosukKwon 124 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from yewentao256 yewentao256 124 days ago
mergify
mergify mergify added deepseek
LopezCastroRoberto adding benchmark script
5a8fe250
LopezCastroRoberto adding benchmark script
9b5c159b
mergify mergify added performance
LopezCastroRoberto tune threadblock size
71d00a2c
LopezCastroRoberto LopezCastroRoberto changed the title [Perf][WIP] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 [Perf] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 124 days ago
LopezCastroRoberto LopezCastroRoberto changed the title [Perf] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 124 days ago
mgoin mgoin added ready
LopezCastroRoberto Merge branch 'main' into concat_mla_q
8d05e603
mergify
mergify mergify added needs-rebase
LucasWilkinson
LucasWilkinson approved these changes on 2026-03-03
LopezCastroRoberto overlap with MoE workspace
8c0e7908
LopezCastroRoberto Merge branch 'main' into concat_mla_q
0e922863
LopezCastroRoberto LopezCastroRoberto requested a review from LucasWilkinson LucasWilkinson 117 days ago
mergify mergify removed needs-rebase
LopezCastroRoberto add missing instructions to vec_utils
36baa4c3
LopezCastroRoberto remove return amd path
75a5e371
LopezCastroRoberto Merge branch 'main' into concat_mla_q
605d23dc
LopezCastroRoberto Merge branch 'main' into concat_mla_q
8a4124c8
LopezCastroRoberto Merge branch 'main' into concat_mla_q
0dba242f
LucasWilkinson LucasWilkinson enabled auto-merge (squash) 114 days ago
mergify
mergify mergify added needs-rebase
LopezCastroRoberto Merge branch 'main' into concat_mla_q
6c4a0420
disabled auto-merge 112 days ago
Head branch was pushed to by a user without write access
mergify mergify removed needs-rebase
LopezCastroRoberto add AMD tests to CI
6f8bb714
mergify mergify added ci/build
LopezCastroRoberto LopezCastroRoberto force pushed to 6f8bb714 112 days ago
LopezCastroRoberto remove amd ci
43dffba1
mgoin
mgoin approved these changes on 2026-03-09
vllm-bot vllm-bot merged 580864d8 into main 111 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone