llama.cpp
f6b533d8 - Vulkan Flash Attention Coopmat1 Refactor (#19075)

Commit

161 days ago

Vulkan Flash Attention Coopmat1 Refactor (#19075) * vulkan: use coopmat for flash attention p*v matrix multiplication * fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

References

#19075 - Vulkan Flash Attention Coopmat1 Refactor

Author

0cc4m

Parents

72d3b189

llama.cpp f6b533d8 - Vulkan Flash Attention Coopmat1 Refactor (#19075)

llama.cpp
f6b533d8 - Vulkan Flash Attention Coopmat1 Refactor (#19075)