llama.cpp
80b717d4 - vulkan: Use unclamped loads for flash attention mask (#12720)

Commit

138 days ago

vulkan: Use unclamped loads for flash attention mask (#12720) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

References

#12720 - vulkan: Use unclamped loads for flash attention mask

Author

jeffbolznv

Parents

6bf28f01

Files2

ggml/src/ggml-vulkan
- ggml-vulkan.cpp
- vulkan-shaders
  - flash_attn_cm2.comp

llama.cpp 80b717d4 - vulkan: Use unclamped loads for flash attention mask (#12720)

llama.cpp
80b717d4 - vulkan: Use unclamped loads for flash attention mask (#12720)