[js/webgpu] Optimize InstanceNorm in some shapes (#22637)
BUG #22031
Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.
The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR #22577