llama.cpp
[CUDA] Increase number of output elements per-thread block if the K-dimension is small
#20635

Merged

[CUDA] Increase number of output elements per-thread block if the K-dimension is small #20635

am17an merged 3 commits into ggml-org:master from gaugarg-nv:small_k_optimization

gaugarg-nv requested a review 90 days ago

github-actions added Nvidia GPU

github-actions added ggml

am17an commented on 2026-03-16

JohannesGaessler commented on 2026-03-16

Increase per-thread work if the K-dimension is small

cfbbfb25

gaugarg-nv force pushed from 4f20a445 to cfbbfb25 88 days ago

gaugarg-nv changed the title ~~[CUDA] Use a single warp per element instead of a single block per element if the K-dimension is small~~ [CUDA] Increase number of output elements per-thread block if the K-dimension is small 88 days ago

Limit this change to ncols_dst = 1

6374ae0e

tab to space

fd9e3348

JohannesGaessler commented on 2026-03-19

am17an approved these changes on 2026-03-21

JohannesGaessler approved these changes on 2026-03-22

am17an merged ccb87fa3 into master 84 days ago

gaugarg-nv deleted the small_k_optimization branch 62 days ago

Reviewers

JohannesGaessler

am17an

Assignees

No one assigned

Labels

Nvidia GPU ggml

Milestone

No milestone

llama.cpp [CUDA] Increase number of output elements per-thread block if the K-dimension is small #20635 Merged

[CUDA] Increase number of output elements per-thread block if the K-dimension is small #20635

llama.cpp
[CUDA] Increase number of output elements per-thread block if the K-dimension is small
#20635

Merged