CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E #15715
Add fastdiv, use it in modulo and use modulo in rms_norm_f32
b2e98318
Support more `block_size` values in `rms_norm_f32`
bcc6c777
ORippler
changed the title CUDA: Optimize `rms_norm_f32` kernel and its fused variants CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E 103 days ago
Update ggml/src/ggml-cuda/common.cuh
30ab9ae4
Replace modulo with fastmodulo in `rms_norm_f32`
18242c3d
Use `BinPackArguments=true` for formating function calls
0129866a
Update ggml/src/ggml-cuda/common.cuh
48afab4b
Use uint3 for both `fastdiv` and `fastmodulo`
8b1e9370
More constrained type declarations
f0dabf29
Rename fastdiv and fastmodulo variables to shared variable name
8bde72b5
Pack fastdiv/fastmodulo constants into uint2/uint3 objects
74146525
Rename function parameter of fastmodulo
0a76b118
ORippler
deleted the osimons/optimize_fused_rms_norm_f32 branch 101 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub