llama.cpp
5814b4dc - cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

Commit

81 days ago

cuda: optimize SOLVE_TRI using registers and FMAF (#17703) * ggml-cuda: optimize solve_tri_f32_fast and fix stride handling - Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition). - Remove unused `MAX_K_FAST` definition. * Small cleanup * Remove comments in solve_tri.cu * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use const for variables in solve_tri.cu * Replace fmaf with more readable code * remove last fmaf --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

References

#17703 - cuda: optimize SOLVE_TRI using registers and FMAF

Author

wsbagnsv1

Parents

79d61896

llama.cpp 5814b4dc - cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

llama.cpp
5814b4dc - cuda: optimize SOLVE_TRI using registers and FMAF (#17703)