onnxruntime
771a4d49 - [webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400)

Commit

53 days ago

[webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400) ### Description This PR fused GeneratePositionIDs into FusedQKRotaryEmbedding which can reduce one kernel call. ### Motivation and Context Previously, for GQA, the processing flow was: `SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding -> FlashAttention` After this change, the pipeline becomes: `SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention` on NV5080, the token generation speed improved ~4%(128tps->133tps)

References

#26400 - [webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding

Author

xiaofeihan1

Parents

954bb7bc

onnxruntime 771a4d49 - [webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400)

onnxruntime
771a4d49 - [webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400)