ggml-cuda: native bf16 flash attention for vec kernel (#20525)
* ggml-cuda: native bf16 flash attention for vec and tile kernels
mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo
* ggml-cuda: address code owner review feedback
reverted tile kernel changes to avoid larger refactor
* fix ci failures on turing and hip
* fix bf16 vec kernel compile on hip v_dot2 platforms
* add comments
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>