vllm
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2
#34917
Merged

[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 #34917

LopezCastroRoberto
LopezCastroRoberto add custom concat kernel
e9831552
LopezCastroRoberto LopezCastroRoberto requested a review from pavanimajety pavanimajety 80 days ago
LopezCastroRoberto LopezCastroRoberto changed the title add custom concat kernel [Perf] Replace torch.cat with vectorized CUDA kernel for MLA query concat 80 days ago
LopezCastroRoberto LopezCastroRoberto marked this pull request as draft 80 days ago
mergify mergify added nvidia
mergify mergify added v1
LopezCastroRoberto LopezCastroRoberto changed the title [Perf] Replace torch.cat with vectorized CUDA kernel for MLA query concat [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat 80 days ago
gemini-code-assist
gemini-code-assist commented on 2026-02-19
LopezCastroRoberto add tests
0ce15f9d
LopezCastroRoberto add helper vec instructions
09150f91
LopezCastroRoberto LopezCastroRoberto changed the title [Perf][WIP] Replace torch.cat with vectorized CUDA kernel for MLA query concat [Perf][WIP] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 75 days ago
LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review 75 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from mgoin mgoin 75 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from tlrmchlsmth tlrmchlsmth 75 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from WoosukKwon WoosukKwon 75 days ago
LopezCastroRoberto LopezCastroRoberto requested a review from yewentao256 yewentao256 75 days ago
mergify
mergify mergify added deepseek
LopezCastroRoberto adding benchmark script
5a8fe250
LopezCastroRoberto adding benchmark script
9b5c159b
mergify mergify added performance
LopezCastroRoberto tune threadblock size
71d00a2c
LopezCastroRoberto LopezCastroRoberto changed the title [Perf][WIP] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 [Perf] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 74 days ago
LopezCastroRoberto LopezCastroRoberto changed the title [Perf] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 [Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 74 days ago
mgoin mgoin added ready
LopezCastroRoberto Merge branch 'main' into concat_mla_q
8d05e603
mergify
mergify mergify added needs-rebase
LucasWilkinson
LucasWilkinson approved these changes on 2026-03-03
LopezCastroRoberto overlap with MoE workspace
8c0e7908
LopezCastroRoberto Merge branch 'main' into concat_mla_q
0e922863
LopezCastroRoberto LopezCastroRoberto requested a review from LucasWilkinson LucasWilkinson 67 days ago
mergify mergify removed needs-rebase
LopezCastroRoberto add missing instructions to vec_utils
36baa4c3
LopezCastroRoberto remove return amd path
75a5e371
LopezCastroRoberto Merge branch 'main' into concat_mla_q
605d23dc
LopezCastroRoberto Merge branch 'main' into concat_mla_q
8a4124c8
LopezCastroRoberto Merge branch 'main' into concat_mla_q
0dba242f
LucasWilkinson LucasWilkinson enabled auto-merge (squash) 65 days ago
mergify
mergify mergify added needs-rebase
LopezCastroRoberto Merge branch 'main' into concat_mla_q
6c4a0420
disabled auto-merge 62 days ago
Head branch was pushed to by a user without write access
mergify mergify removed needs-rebase
LopezCastroRoberto add AMD tests to CI
6f8bb714
mergify mergify added ci/build
LopezCastroRoberto LopezCastroRoberto force pushed to 6f8bb714 62 days ago
LopezCastroRoberto remove amd ci
43dffba1
mgoin
mgoin approved these changes on 2026-03-09
vllm-bot vllm-bot merged 580864d8 into main 62 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone