onnxruntime
cf8476b1 - [webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding (#26447)

Commit

89 days ago

[webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding (#26447) ### Description  When `is_packed_qkv_` and `do_rotary_`, call a new `SplitPackedQKVWithRotaryEmbedding` which fused SplitPackedQKV with FusedQKRotaryEmbedding. Dispatch size is B*S*N*work_per_head. (work_per_head is `head_size - half_rotary_embedding_dim`, is equal `half_rotary_embedding_dim + need_copy_dim`) - For `half_rotary_embedding_dim`, we split packedQKV and then do rotary for pairs q/k and directly store v. - For `need_copy_dim`, we split packedQKV and then directly store q/k/v ### Motivation and Context On NV5080, the token generation speed improve ~3%. | generation tps | Before | After | |--------|--------|-------| | NV5080 | 129 | **133** | | Intel | 15.4 | 15.5 | | Mac | 69.0 | 71.0 |

References

#26447 - [webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding

Author

xiaofeihan1

Parents

ed46e3a8

onnxruntime cf8476b1 - [webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding (#26447)

onnxruntime
cf8476b1 - [webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding (#26447)