[CUDA] GQA CUDA Kernel Fusion and Performance Optimization #26920
GQA cuda fused kernel for kv cache and rotary
a412553a
use fused kernel for packed qkv, rotary and first prompt
6a3742d4
tianleiwu
marked this pull request as draft 133 days ago
flash attention fast decode
5ee35da2
tianleiwu
marked this pull request as ready for review 133 days ago
update #include
9f6f0734
review feedback
8d20af56
Merge branch 'main' into tlwu/cuda_gqa_fused_kernel
eb5b1838
tianleiwu
marked this pull request as draft 131 days ago
tianleiwu
changed the title [CUDA] GQA Fused Kernel for QKV Unpack, RoPE, and KV Cache Append [CUDA] GQA CUDA Kernel Fusion and Performance Optimization 131 days ago
Improve kernel, document and tests
e768ee05
tianleiwu
dismissed their stale review
via e768ee05
131 days ago
tianleiwu
dismissed their stale review
via e768ee05
131 days ago
avoid overflow
699e395d
clean up and assert alignment
69087ca8
optimize buffer size
684c7cb5
tianleiwu
marked this pull request as ready for review 131 days ago
tianleiwu
merged
39d8520b
into main 130 days ago
tianleiwu
deleted the tlwu/cuda_gqa_fused_kernel branch 130 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub