onnxruntime
[CUDA] GQA CUDA Kernel Fusion and Performance Optimization
#26920

Merged

[CUDA] GQA CUDA Kernel Fusion and Performance Optimization #26920

tianleiwu merged 10 commits into main from tlwu/cuda_gqa_fused_kernel

GQA cuda fused kernel for kv cache and rotary

a412553a

use fused kernel for packed qkv, rotary and first prompt

6a3742d4

tianleiwu marked this pull request as draft 133 days ago

flash attention fast decode

5ee35da2

tianleiwu marked this pull request as ready for review 133 days ago

tianleiwu requested a review from

kunal-vaishnavi 133 days ago

tianleiwu requested a review from

nenad1002 133 days ago

tianleiwu requested a review from

apsonawane 133 days ago

tianleiwu requested a review from

copilot-pull-request-reviewer 133 days ago

apsonawane commented on 2026-01-07

update #include

9f6f0734

copilot-pull-request-reviewer commented on 2026-01-07

review feedback

8d20af56

apsonawane dismissed these changes on 2026-01-07

kunal-vaishnavi dismissed these changes on 2026-01-07

Merge branch 'main' into tlwu/cuda_gqa_fused_kernel

eb5b1838

tianleiwu marked this pull request as draft 131 days ago

tianleiwu changed the title ~~[CUDA] GQA Fused Kernel for QKV Unpack, RoPE, and KV Cache Append~~ [CUDA] GQA CUDA Kernel Fusion and Performance Optimization 131 days ago

Improve kernel, document and tests

e768ee05

tianleiwu dismissed their stale review via e768ee05 131 days ago

avoid overflow

699e395d

clean up and assert alignment

69087ca8

optimize buffer size

684c7cb5

tianleiwu marked this pull request as ready for review 131 days ago

tianleiwu requested a review from

kunal-vaishnavi 131 days ago

tianleiwu requested a review from

apsonawane 131 days ago

tianleiwu requested a review from

copilot-pull-request-reviewer 131 days ago

copilot-pull-request-reviewer commented on 2026-01-09

kunal-vaishnavi commented on 2026-01-09

kunal-vaishnavi approved these changes on 2026-01-09

tianleiwu merged 39d8520b into main 130 days ago

tianleiwu deleted the tlwu/cuda_gqa_fused_kernel branch 130 days ago

Reviewers

kunal-vaishnavi

apsonawane

copilot-pull-request-reviewer

nenad1002

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

onnxruntime [CUDA] GQA CUDA Kernel Fusion and Performance Optimization #26920 Merged

[CUDA] GQA CUDA Kernel Fusion and Performance Optimization #26920

onnxruntime
[CUDA] GQA CUDA Kernel Fusion and Performance Optimization
#26920

Merged