llama.cpp
bcb43163 - ggml-cpu: Use tiled FA for prompt-processing (#19012)

Commit
32 days ago
ggml-cpu: Use tiled FA for prompt-processing (#19012) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier
Author
Parents
Loading