[webgpu] Optimize Conv by im2col-matmul (#26603)
### Description
This PR optimizes the `Conv` operation by implementing two new compute
shaders: `oihw_to_ohwi` and `im2col-matmul`.
`oihw_to_ohwi`:
Improves performance over the default Transpose shader by utilizing
workgroup memory to ensure continuous memory read/write patterns.
`im2col-matmul`:
- Employs a workgroup size of 64.
- Dynamically selects tile sizes (32x64 or 16x64) based on the
source/weight shape.
- Each invocation handles a dedicated weight element.
- Uses subgroupShuffle to efficiently access the source tile, leveraging
k_vec4 vectorization for better memory throughput.
Testing on Lunar Lake demonstrated **up to an 87%** performance
improvement in Conv_2D operations.
### Motivation and Context
See above.