llama.cpp
15f786e6 - [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

Commit
9 days ago
[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original
Author
Parents
Loading