onnxruntime
178f7caa - GQA Memory Efficient Kernel (#17920)

Commit
2 years ago
GQA Memory Efficient Kernel (#17920) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.
Author
Parents
Loading