llama.cpp
[CUDA] Increase number of output elements per-thread block if the K-dimension is small
#20635
Merged

[CUDA] Increase number of output elements per-thread block if the K-dimension is small #20635

gaugarg-nv
gaugarg-nv gaugarg-nv requested a review 24 days ago
github-actions github-actions added Nvidia GPU
github-actions github-actions added ggml
am17an
gaugarg-nv
am17an
am17an commented on 2026-03-16
JohannesGaessler
JohannesGaessler commented on 2026-03-16
gaugarg-nv Increase per-thread work if the K-dimension is small
cfbbfb25
gaugarg-nv gaugarg-nv force pushed from 4f20a445 to cfbbfb25 22 days ago
gaugarg-nv gaugarg-nv changed the title [CUDA] Use a single warp per element instead of a single block per element if the K-dimension is small [CUDA] Increase number of output elements per-thread block if the K-dimension is small 22 days ago
gaugarg-nv Limit this change to ncols_dst = 1
6374ae0e
gaugarg-nv tab to space
fd9e3348
JohannesGaessler
JohannesGaessler commented on 2026-03-19
am17an
ggerganov
IMbackK
gaugarg-nv
IMbackK
am17an
am17an
am17an approved these changes on 2026-03-21
JohannesGaessler
JohannesGaessler
am17an
JohannesGaessler
JohannesGaessler approved these changes on 2026-03-22
am17an am17an merged ccb87fa3 into master 18 days ago
CISC
JohannesGaessler
IMbackK

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone