llama.cpp
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case
#12183
Merged

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case #12183

gaugarg-nv
gaugarg-nv gaugarg-nv requested a review from JohannesGaessler JohannesGaessler 247 days ago
github-actions github-actions added Nvidia GPU
github-actions github-actions added ggml
gaugarg-nv gaugarg-nv force pushed from bb46418c to be39646a 247 days ago
gaugarg-nv gaugarg-nv force pushed from be39646a to 76881ac5 247 days ago
gaugarg-nv gaugarg-nv changed the title CUDA: Improve flash decoding kernel occupancy for BS=1 case CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case 246 days ago
gaugarg-nv
ggerganov
ggerganov commented on 2025-03-06
gaugarg-nv
JohannesGaessler
JohannesGaessler commented on 2025-03-06
gaugarg-nv
JohannesGaessler
IMbackK
gaugarg-nv gaugarg-nv force pushed from a8a71758 to d83b0d07 244 days ago
github-actions github-actions added testing
gaugarg-nv
gaugarg-nv
JohannesGaessler
Beinsezii
gaugarg-nv
Beinsezii
JohannesGaessler
JohannesGaessler
gaugarg-nv gaugarg-nv force pushed from d83b0d07 to b6e067b9 235 days ago
gaugarg-nv
gaugarg-nv
JohannesGaessler CUDA: determine FA parallel blocks at runtime
2d011e65
gaugarg-nv CUDA: Improve flash decoding kernel occupancy for BS=1 case
aa5aa01d
gaugarg-nv gaugarg-nv force pushed from b6e067b9 to aa5aa01d 232 days ago
gaugarg-nv
JohannesGaessler
gaugarg-nv
JohannesGaessler
gaugarg-nv
JohannesGaessler consider tail effects for parallel_blocks
66d873b6
JohannesGaessler
Beinsezii
IMbackK
ggerganov
JohannesGaessler
JohannesGaessler
JohannesGaessler approved these changes on 2025-03-19
IMbackK
Beinsezii
IMbackK
IMbackK approved these changes on 2025-03-19
ggerganov
ggerganov approved these changes on 2025-03-19
IMbackK IMbackK merged 517b5ddb into master 232 days ago
gaugarg-nv gaugarg-nv deleted the flash_decoding_improvement branch 232 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone