[webgpu] Support RotaryEmbedding in flash attention (#26297)
### Description
We enabled Rotary Embedding (ROE) support in the Flash Attention code
and introduced a fused operator, `FusedQKRotaryEmbeddingProgram`, which
combines two ROE operations into a single kernel.
### Motivation and Context
Previously, Flash Attention for WebGPU did not support ROE, while the
CUDA version did. As a result, different operators were used for the
Phi-4 model. With this PR, both CUDA and WebGPU can share the same Phi-4
model implementation.
Fusing the ROE operations within the attention module helps reduce
CPU-side overhead by:
1. Decreasing the number of MatMulNbits operations.
2. Reducing memcpyFromHost calls.
This optimization improves **token generation speed** on
high-performance GPUs — **achieving over 5% speedup on NVIDIA 5080 and
4% on Apple M3 Max.**
We believe the generation phase is CPU-bound on such high-end GPUs, so
reducing CPU time leads to noticeable gains.
However, the improvement is negligible on low-performance GPUs.
<img width="1589" height="870" alt="image"
src="https://github.com/user-attachments/assets/ae8a73b3-3e41-4ae6-9687-43058c2be140"
/>