llama.cpp
c446b2ed - vulkan: Submit once enough matmul work has been recorded (#12406)

Commit

267 days ago

vulkan: Submit once enough matmul work has been recorded (#12406) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

References

#12406 - vulkan: Submit once enough matmul work has been recorded

Author

jeffbolznv

Parents

d84635b1

llama.cpp c446b2ed - vulkan: Submit once enough matmul work has been recorded (#12406)

llama.cpp
c446b2ed - vulkan: Submit once enough matmul work has been recorded (#12406)