Rope imbedding kernel to use avx2 (#23694)
### Description
<!-- Describe your changes. -->
Credit to [chethanpk](https://github.com/chethanpk) who provided with
Rope Embedding in a patch. The patch is in the first commit of this PR.
I have been confirming perf improvement with this code change. My
analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32.
Benchmark from onnxruntim-genai does not show clear improvement. this is
because GQA only takes a small portion of the whole model (<10%) and
Rope within GQA only take small portion of the whole GQA (12%). The
following is the profile with and without avx2
we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still
recommend to merge this PR.
with avx2 RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage:
3.16%
plain c++ RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage:
12.20%
mlas benchmark:
dim|interleaved|baseline|new
-|-|-|-
128 |false|735|18.1
256 |false|1470|31.7
512 |false|2938|59.2
1024 |false|5876|81.5
128 |true|368|23.1
256 |true|735|34.3
512 |true|1470|62.0
1024 |true|2937|125
---------
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>