llama.cpp
8b11deea - Hide latency of bias and gate-loading (#16847)

Commit

52 days ago

Hide latency of bias and gate-loading (#16847) This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.

References

#16847 - CUDA: Hide latency of bias and gate-loading for fused `mul_mat_vec_q`

Author

ORippler

Parents

b9ce9401

llama.cpp 8b11deea - Hide latency of bias and gate-loading (#16847)

llama.cpp
8b11deea - Hide latency of bias and gate-loading (#16847)