onnxruntime
4adef01e - [CUDA] Update sm check for flash attention (#24584)

Commit

245 days ago

[CUDA] Update sm check for flash attention (#24584) ### Description Currently, flash attention is only enabled for sm8x and sm90. That means blackwell GPU will not use flash attention. This change is enable flash attention for sm > 90. Note that the flash attention implementation is not optimized for blackwell, but shall be able to run in blackwell GPU. Future works: * Integrate flash attn for hopper: https://github.com/Dao-AILab/flash-attention/tree/main/hopper * Integrate fmha for blackwell: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha * Update cudnn and cudnn frontend to latest version (so that we can use the cudnn flash attention for blackwell). ### Motivation and Context ORT GENAI is slow in RTX 5090

References

#24584 - [CUDA] Update sm check for flash attention

Author

tianleiwu

Parents

d8694103

onnxruntime 4adef01e - [CUDA] Update sm check for flash attention (#24584)

onnxruntime
4adef01e - [CUDA] Update sm check for flash attention (#24584)