Add GQA on CPU in LLaMA scripts (#20720)

Commit

1 year ago

Add GQA on CPU in LLaMA scripts (#20720) ### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](https://github.com/microsoft/onnxruntime/pull/20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](https://github.com/microsoft/onnxruntime/pull/20598).

References

#20720 - Add GQA on CPU in LLaMA scripts

Author

kunal-vaishnavi

Parents

bd7a0fb3

onnxruntime 72a3bde3 - Add GQA on CPU in LLaMA scripts (#20720)

onnxruntime
72a3bde3 - Add GQA on CPU in LLaMA scripts (#20720)