onnxruntime
cbb29d80 - GQA Rotary and Packed QKV with Flash (#18906)

Commit

2 years ago

GQA Rotary and Packed QKV with Flash (#18906) ### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

References

#18906 - GQA Rotary and Packed QKV with Flash

Author

aciddelgado

Parents

532f8c64

onnxruntime cbb29d80 - GQA Rotary and Packed QKV with Flash (#18906)

onnxruntime
cbb29d80 - GQA Rotary and Packed QKV with Flash (#18906)