onnxruntime
771a4d49 - [webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400)

Commit
53 days ago
[webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400) ### Description This PR fused GeneratePositionIDs into FusedQKRotaryEmbedding which can reduce one kernel call. ### Motivation and Context Previously, for GQA, the processing flow was: `SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding -> FlashAttention` After this change, the pipeline becomes: `SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention` on NV5080, the token generation speed improved ~4%(128tps->133tps)
Author
Parents
Loading