onnxruntime
df8bf2df - [webgpu] Optimize InstanceNormalization by removing redundant transpose (#26626)

Commit

30 days ago

[webgpu] Optimize InstanceNormalization by removing redundant transpose (#26626) ### Description  This PR optimizes `InstanceNormalization` by removing redundant transpose. Given the implementation of `InstanceNormalization` for `NCHW` is more effiencient, we don't need to add wrapper `Transpose` to make it run in `NHWC`, which helps use to elide redundant transpose and improve performance. Testing on Lunar Lake shows about `~60%` performance improvement in `InstanceNormalization` operations. #### `InstanceNormalization` OP benchmark The input tensor shape: `(1,32,1048576)` The scale tensor shape: `(32)` The B tensor shape: `(32)` | time cost (ms) | baseline | opt | diff | | ---------------- | -------- | ---- | ---- | | Lunar Lake | 82.6 | 34.2 | 58% | #### Model benchmark | time cost (ms) | baseline | opt | diff | | ---------------- | -------- | ---- | ---- | | sd-turbo-vae-decoder-fp16-demo | 2437.6 | 1835.9 | 25% | ### Motivation and Context  Please see above

References

#26626 - [webgpu] Optimize InstanceNormalization by removing redundant transpose

Author

wenqinI

Parents

ab4831d8

onnxruntime df8bf2df - [webgpu] Optimize InstanceNormalization by removing redundant transpose (#26626)

onnxruntime
df8bf2df - [webgpu] Optimize InstanceNormalization by removing redundant transpose (#26626)