llama.cpp
be0a0f8c - vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)

Commit

278 days ago

vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

References

#12559 - vulkan: Implement grouped query attention in the coopmat2 FA shader

Author

jeffbolznv

Parents

92e3006b

llama.cpp be0a0f8c - vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)

llama.cpp
be0a0f8c - vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)