onnxruntime
d1857d12 - Enable BF16 Cutlass FMHA (#26894)

Commit
28 days ago
Enable BF16 Cutlass FMHA (#26894) Enables BF16 support for Cutlass FMHA in GroupQueryAttention and MultiHeadAttention operators. Includes updates to: - CUDA kernels for BF16 FMHA. - GroupQueryAttention and PackedMultiHeadAttention implementations. - Update IO Binding Helper for BF16 model - Extensive test updates in `test_gqa.py` including adding BF16 test cases, and reduce combinations to speed up test.
Author
Parents
Loading