llama.cpp
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E
#15715

Merged

CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E #15715

JohannesGaessler merged 11 commits into ggml-org:master from ORippler:osimons/optimize_fused_rms_norm_f32

github-actions added Nvidia GPU

github-actions added ggml

Add fastdiv, use it in modulo and use modulo in rms_norm_f32

b2e98318

Support more `block_size` values in `rms_norm_f32`

bcc6c777

ORippler changed the title ~~CUDA: Optimize `rms_norm_f32` kernel and its fused variants~~ CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E 103 days ago

JohannesGaessler commented on 2025-09-02

Update ggml/src/ggml-cuda/common.cuh

30ab9ae4

Replace modulo with fastmodulo in `rms_norm_f32`

18242c3d

Use `BinPackArguments=true` for formating function calls

0129866a

JohannesGaessler commented on 2025-09-02

Update ggml/src/ggml-cuda/common.cuh

48afab4b

JohannesGaessler commented on 2025-09-03

Use uint3 for both `fastdiv` and `fastmodulo`

8b1e9370

ORippler requested a review from

JohannesGaessler 102 days ago

JohannesGaessler approved these changes on 2025-09-03

More constrained type declarations

f0dabf29

Rename fastdiv and fastmodulo variables to shared variable name

8bde72b5

JohannesGaessler merged 661ae31c into master 102 days ago

Pack fastdiv/fastmodulo constants into uint2/uint3 objects

74146525

Rename function parameter of fastmodulo

0a76b118

ORippler deleted the osimons/optimize_fused_rms_norm_f32 branch 101 days ago

Reviewers

JohannesGaessler

Assignees

No one assigned

Labels

Nvidia GPU ggml

Milestone

No milestone

llama.cpp CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E #15715 Merged

CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E #15715

llama.cpp
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E
#15715

Merged