llama.cpp
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E
#15715
Merged

CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E #15715

ORippler
github-actions github-actions added Nvidia GPU
github-actions github-actions added ggml
ORippler Add fastdiv, use it in modulo and use modulo in rms_norm_f32
b2e98318
ORippler Support more `block_size` values in `rms_norm_f32`
bcc6c777
ORippler ORippler changed the title CUDA: Optimize `rms_norm_f32` kernel and its fused variants CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E 103 days ago
JohannesGaessler
JohannesGaessler commented on 2025-09-02
ORippler Update ggml/src/ggml-cuda/common.cuh
30ab9ae4
ORippler Replace modulo with fastmodulo in `rms_norm_f32`
18242c3d
ORippler Use `BinPackArguments=true` for formating function calls
0129866a
ORippler
JohannesGaessler
JohannesGaessler commented on 2025-09-02
ORippler Update ggml/src/ggml-cuda/common.cuh
48afab4b
JohannesGaessler
JohannesGaessler commented on 2025-09-03
ORippler Use uint3 for both `fastdiv` and `fastmodulo`
8b1e9370
ORippler
ORippler ORippler requested a review from JohannesGaessler JohannesGaessler 102 days ago
JohannesGaessler
JohannesGaessler approved these changes on 2025-09-03
ORippler More constrained type declarations
f0dabf29
ORippler Rename fastdiv and fastmodulo variables to shared variable name
8bde72b5
JohannesGaessler
JohannesGaessler
JohannesGaessler JohannesGaessler merged 661ae31c into master 102 days ago
JohannesGaessler
ORippler Pack fastdiv/fastmodulo constants into uint2/uint3 objects
74146525
ORippler Rename function parameter of fastmodulo
0a76b118
ORippler
ORippler ORippler deleted the osimons/optimize_fused_rms_norm_f32 branch 101 days ago
JohannesGaessler

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone