[webgpu] Fused SplitPackedQKV with FusedQKRotaryEmbedding (#26447)
### Description
<!-- Describe your changes. -->
When `is_packed_qkv_` and `do_rotary_`, call a new
`SplitPackedQKVWithRotaryEmbedding` which fused SplitPackedQKV with
FusedQKRotaryEmbedding.
Dispatch size is B*S*N*work_per_head. (work_per_head is `head_size -
half_rotary_embedding_dim`, is equal `half_rotary_embedding_dim +
need_copy_dim`)
- For `half_rotary_embedding_dim`, we split packedQKV and then do rotary
for pairs q/k and directly store v.
- For `need_copy_dim`, we split packedQKV and then directly store q/k/v
### Motivation and Context
On NV5080, the token generation speed improve ~3%.
| generation tps | Before | After |
|--------|--------|-------|
| NV5080 | 129 | **133** |
| Intel | 15.4 | 15.5 |
| Mac | 69.0 | 71.0 |