[webgpu] Optimize InstanceNormalization by removing redundant transpose (#26626)
### Description
<!-- Describe your changes. -->
This PR optimizes `InstanceNormalization` by removing redundant
transpose.
Given the implementation of `InstanceNormalization` for `NCHW` is more
effiencient, we don't need to add wrapper `Transpose` to make it run in
`NHWC`, which helps use to elide redundant transpose and improve
performance.
Testing on Lunar Lake shows about `~60%` performance improvement in
`InstanceNormalization` operations.
#### `InstanceNormalization` OP benchmark
The input tensor shape: `(1,32,1048576)`
The scale tensor shape: `(32)`
The B tensor shape: `(32)`
| time cost (ms) | baseline | opt | diff |
| ---------------- | -------- | ---- | ---- |
| Lunar Lake | 82.6 | 34.2 | 58% |
#### Model benchmark
| time cost (ms) | baseline | opt | diff |
| ---------------- | -------- | ---- | ---- |
| sd-turbo-vae-decoder-fp16-demo | 2437.6 | 1835.9 | 25% |
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Please see above