onnxruntime
27a03105 - Add support for custom position ids and attention bias to GQA CPU operator (#23944)

Commit
283 days ago
Add support for custom position ids and attention bias to GQA CPU operator (#23944) ### Description - Added support for custom position ids and attention masks to the GQA CPU operator (fp32 and fp16) - Added MLAS eltwise add kernel for mask application for FP32 and FP16 - Added unit tests for the added eltwise add MLAS kernel - Modified python tests to test the new GQA inputs ### Motivation and Context Custom position ids and attention mask are required in order to implement speculative decoding in PhiSilica ### Benchmarks All the benchmarks are executed on the GQA op configuration which will be used in the PhiSilica speculative decoding secnario, and the configuration is as follows: - num_heads: 32 - kv_num_heads: 32 - do_rotary: 1 - local_window_size: -1 - head_size: 96 - sequence_length: 6 - packed_qkv: True Benchmarks were executed on Cadmus with Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz In the tables below, column headers are total sequence length values used for benchmarking, and the row values are if the attention bias was used or not. Values are average inference time in ms over 100000 runs. #### Fp16 results | Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 | 2000 | 2500 | 3000 | 3500 | 4000 | |:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------| | Without bias | 0.284054 | 0.257449 | 0.275806 | 0.334123 | 0.458324 | 0.614133 | 0.912791 | 1.38585 | 1.92186 | 2.39203 | 2.88808 | 3.46262 | | With bias | 0.250926 | 0.253072 | 0.279724 | 0.337774 | 0.499058 | 0.585388 | 0.914316 | 1.40701 | 1.87311 | 2.47475 | 3.3906 | 3.47474 | | Runtime increase | -11.66% | -1.7% | +1.42% | +1.09% | +8.89% | -4.68% | +0.17% | +1.53% | -2.54% | +3.46% | +17.4% | +0.35% | #### Fp32 results | Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 | 2000 | 2500 | 3000 | 3500 | 4000 | |:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------|:--------| | Without bias | 0.259049 | 0.270541 | 0.304583 | 0.376708 | 0.554013 | 0.633217 | 1.20696 | 1.65985 | 1.95169 | 2.45807 | 3.05637 | 4.05169 | | With bias | 0.261631 | 0.268002 | 0.300853 | 0.370452 | 0.529865 | 0.735216 | 1.43493 | 1.4385 | 1.99028 | 2.3858 | 2.99425 | 4.80197 | | Runtime increase | +1.0% | -0.94% | -1.22% | -1.66% | -4.36% | +16.11% | +18.89% | -13.34% | +1.98% | -2.94% | -2.03% | +18.52% | --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit 5a694bcbc8bd549f24b7cd4e88cbacb4f42b894a)
Committer
Parents
Loading