llama.cpp
0ccbfdef - hexagon: further optimizations and refactoring for flash attention (#19583)

Commit

7 days ago

hexagon: further optimizations and refactoring for flash attention (#19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>

References

#19583 - hexagon: further optimizations and refactoring for flash attention

Author

max-krasnyansky

Parents

94a602db

llama.cpp 0ccbfdef - hexagon: further optimizations and refactoring for flash attention (#19583)

llama.cpp
0ccbfdef - hexagon: further optimizations and refactoring for flash attention (#19583)