[js/webgpu] Optimize matmul (#16969)
### Description
Changes in this PR:
1) use the optimized version `makeMatMulPacked[Vec4]Source` to support
matmul.
2) enable the conv2dByMatMul path.
3) support broadcast
4) use IndicesHelper.
MatMul with M = 512, K = 512, N = 512 becomes 2ms from 15ms when
enabling profilingMode on my ADL.