llama.cpp
15f786e6 - [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

Commit

9 days ago

[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original

References

#21159 - [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel

Author

gaugarg-nv

Parents

94ca829b

llama.cpp 15f786e6 - [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

llama.cpp
15f786e6 - [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)