[webgpu] Fused CopyKVCache and SplitPackedQKVWithRotaryEmbedding as SplitPackedQKVWithRotaryEmbeddingAndCopyKV (#26563)
### Description
<!-- Describe your changes. -->
Create a ultimated fused path called
SplitPackedQKVWithRotaryEmbeddingAndCopyKV which fused
SplitPackedQKVWithRotaryEmbedding and CopyKVCache. When use flash
attention and static kv cache is enabled, run it.
We did the following things:
- Support components for existed SplitPackedQKVWithRotaryEmbedding
- Fused it and copykvcache as new
SplitPackedQKVWithRotaryEmbeddingAndCopyKV
### Motivation and Context
On NV5080, the token generation speed improve ~4%.
| generation tps | Before | After |
|--------|--------|-------|
| NV5080 | 135 | **141** |
| Intel | 15.3 | 15.4 |
| Mac | 71.2 | 71.8 |