vulkan: Flash Attention DP4A shader for quantized KV cache (#20797)
* use integer dot product for quantized KV flash attention
* small improvements
* fix SHMEM_STAGING indexing
* add missing KV type quants
* fixes
* add supported quants to FA tests
* readd fast paths for <8bit quants
* fix mmq gate and shmem checks