[webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding (#26400)
### Description
This PR fused GeneratePositionIDs into FusedQKRotaryEmbedding which can
reduce one kernel call.
### Motivation and Context
Previously, for GQA, the processing flow was:
`SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding
-> FlashAttention`
After this change, the pipeline becomes:
`SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention`
on NV5080, the token generation speed improved ~4%(128tps->133tps)