CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n #15132
Factor out `reduce_rows_f32` from common.cuh
3deb3b16
Hide memory-latency by loop unrolling in reduce_rows_f32
c270ffe1
Further optimizations to `reduce_rows_f32`
ece608a1
Add perf tests for `reduce_rows_f32` kernel
9070af87
Add heuristic to toggle 128/512 threads based on sm count
80de6722
Ensure perf gains also for small ncols and large nrows
8e04242c
Modify perf and unit-tests
8fc2c03d
Apply auto-formatting by clang
9296d1f8
ORippler
force pushed
from
c6ed8cc9
to
9296d1f8
64 days ago
Fix CI build failure
a6fe4dd5
Remove sm_count property from `ggml_backend_cuda_context`
4a1c5bc8
Add CUB-based implementation for GGML_OP_MEAN
7c7413ec
Add heuristics to execute CUB branch only when it brings perf
48cf9e43
Add unit-test for CUB-based mean
e8373bf6
Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
0e9a5d86
Unindent Preprocessor directives
d647028a
ORippler
deleted the osimons/optimize_reduce_rows_f32 branch 58 days ago
Assignees
No one assigned
Labels
testing
Nvidia GPU
ggml
Login to write a write a comment.
Login via GitHub