Packed QKV and Rotary Embedding Support for sm<80 GQA (#20012)
### Description
Add support for packed qkv input and rotary embedding with sm<80 using
memory efficient attention kernel.
### Motivation and Context
Allows lower-end gpus to run gqa with packed qkv input and rotary
embedding.