llama.cpp
f01bd023 - vulkan: Implement split_k for coopmat2 flash attention. (#12627)

Commit

67 days ago

vulkan: Implement split_k for coopmat2 flash attention. (#12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.

References

#12627 - vulkan: Implement split_k for coopmat2 flash attention.

Author

jeffbolznv

Parents

6f3bd386

Files5

ggml/src/ggml-vulkan
- ggml-vulkan.cpp
- vulkan-shaders
  - flash_attn_cm2.comp
  - flash_attn_split_k_reduce.comp
  - vulkan-shaders-gen.cpp
tests
- test-backend-ops.cpp

llama.cpp f01bd023 - vulkan: Implement split_k for coopmat2 flash attention. (#12627)

llama.cpp
f01bd023 - vulkan: Implement split_k for coopmat2 flash attention. (#12627)