whisper.cpp
fadb3233 - vulkan: optimize flash attention split_k_reduce (llama/14554)

Commit

161 days ago

vulkan: optimize flash attention split_k_reduce (llama/14554) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

References

#3319 - sync : ggml

Author

jeffbolznv

Committer

ggerganov

Parents

9750e4c9

whisper.cpp fadb3233 - vulkan: optimize flash attention split_k_reduce (llama/14554)

whisper.cpp
fadb3233 - vulkan: optimize flash attention split_k_reduce (llama/14554)