Enable BF16 Cutlass FMHA (#26894)

Commit

28 days ago

Enable BF16 Cutlass FMHA (#26894) Enables BF16 support for Cutlass FMHA in GroupQueryAttention and MultiHeadAttention operators. Includes updates to: - CUDA kernels for BF16 FMHA. - GroupQueryAttention and PackedMultiHeadAttention implementations. - Update IO Binding Helper for BF16 model - Extensive test updates in `test_gqa.py` including adding BF16 test cases, and reduce combinations to speed up test.

References

#26894 - Enable BF16 Cutlass FMHA

Author

tianleiwu

Parents

07ec4515

onnxruntime d1857d12 - Enable BF16 Cutlass FMHA (#26894)

onnxruntime
d1857d12 - Enable BF16 Cutlass FMHA (#26894)