onnxruntime
3dece27f - GQA Flash Attention with Attention Mask (#18283)

Commit

2 years ago

GQA Flash Attention with Attention Mask (#18283) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

References

#18283 - GQA Flash Attention with Attention Mask

Author

aciddelgado

Parents

10df847b

onnxruntime 3dece27f - GQA Flash Attention with Attention Mask (#18283)

onnxruntime
3dece27f - GQA Flash Attention with Attention Mask (#18283)