[WebGPU] Optimize GEMM with vec4 (#24478)
### Description
<!-- Describe your changes. -->
In this PR, we use vec4 to optimize GEMM when colums of A and B can be
divided by 4, or use previous shader.
I will add u32/vec2 implementation in the future, and we will only keep
one shader at that time.
### Perf comparison
I run customized model only include GEMM(M = N = K = 1024) with nodejs
on M2/M3 Max. Roughly 20% increase.
|| !transA&&!transB | transA | transB | transA&&transB |
|------------------|------------|------------|----------------|------------|
| M2 | 9.36->7.41 | 9.45->7.54 | 11.21->8.19 | 9.66->8.37 |
| M3 max | 8.07->6.99 | 7.54->6.53 | 8.42->5.89 | 5.47->5.29 |