onnxruntime
3dece27f - GQA Flash Attention with Attention Mask (#18283)

Commit
2 years ago
GQA Flash Attention with Attention Mask (#18283) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Author
Parents
Loading