GQA Memory Efficient Kernel (#17920)

Commit

2 years ago

GQA Memory Efficient Kernel (#17920) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.

References

#17920 - GQA Memory Efficient Kernel

Author

aciddelgado

Parents

a2e9ba72

onnxruntime 178f7caa - GQA Memory Efficient Kernel (#17920)

onnxruntime
178f7caa - GQA Memory Efficient Kernel (#17920)