llama.cpp
CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n
#15132
Merged

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132

ORippler
github-actions github-actions added testing
github-actions github-actions added Nvidia GPU
github-actions github-actions added ggml
ORippler Factor out `reduce_rows_f32` from common.cuh
3deb3b16
ORippler Hide memory-latency by loop unrolling in reduce_rows_f32
c270ffe1
ORippler Further optimizations to `reduce_rows_f32`
ece608a1
ORippler Add perf tests for `reduce_rows_f32` kernel
9070af87
ORippler Add heuristic to toggle 128/512 threads based on sm count
80de6722
ORippler Ensure perf gains also for small ncols and large nrows
8e04242c
ORippler Modify perf and unit-tests
8fc2c03d
ORippler Apply auto-formatting by clang
9296d1f8
ORippler ORippler force pushed from c6ed8cc9 to 9296d1f8 64 days ago
ORippler
ORippler Fix CI build failure
a6fe4dd5
JohannesGaessler
JohannesGaessler commented on 2025-08-07
ORippler Remove sm_count property from `ggml_backend_cuda_context`
4a1c5bc8
JohannesGaessler
IMbackK
ORippler Add CUB-based implementation for GGML_OP_MEAN
7c7413ec
ORippler Add heuristics to execute CUB branch only when it brings perf
48cf9e43
ORippler Add unit-test for CUB-based mean
e8373bf6
ORippler
ORippler
ORippler ORippler requested a review from JohannesGaessler JohannesGaessler 60 days ago
JohannesGaessler
JohannesGaessler approved these changes on 2025-08-11
ORippler Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
0e9a5d86
ORippler Unindent Preprocessor directives
d647028a
ORippler
JohannesGaessler JohannesGaessler merged 6028bf74 into master 58 days ago
JohannesGaessler
ORippler ORippler deleted the osimons/optimize_reduce_rows_f32 branch 58 days ago
ggerganov

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone