[WebGPU] Implement Split-K on Conv|MatMul (#26461)
### Description
This patch implements the `Split-K` optimization on `Conv|MatMul`. With
`Split-K` we can re-arrange the computation into multiple workgroups
when `K` is large to increase the parallelism on the platforms that
`Split-K` is confirmed to be useful.
1. Support `Split-K` in `MakeMatMulPackedVec4Source()` to split a
workgroup with large K into smaller ones. In this patch we only support
`Split-K` with `batch_size == 1` and `vec4` on `Conv|MatMul`.
2. Support `Split-K` in `MatMulWriteFnSource()` (add the partial result
to output with atomic built-in functions)
3. Implement `SplitKConfig` to decide whether `Split-K` should be used
or not, and all the related thresholds.
4. Implement `MatMulFillBiasBeforeSplitKProgram` to initialize the
output with `bias` or 0 when `Split-K` is used.
### Motivation and Context
In current implementation, when `K` or `dim_inner` is large, in each
invocation we always do the computation one by one in a very large loop,
which may not make full use of all EUs on a GPU.
With `Split-K` we can split such large amount of computation (`K`) into
multiple workgroups with less computation (`kSplitK`, smaller than K),
which can greatly improve the parallelism.
With this patch we can get about 15% performance improvement on
`efficientnet-lite-f16-demo` and 9% improvement on
`mobilenetv2-12-f16-demo` on Lunar Lake and Meteor Lake.