[WebGPU] Implement Split-K on GEMM (#26751)
### Description
This patch implements the `Split-K` optimization on `GEMM`.
1. Support handling `GEMM` in `MatMulFillBiasOrZeroBeforeSplitKProgram`.
We need to add `beta` as a new uniform value and all the parameters that
are used to handle all the cases of `GEMM` in `MatMulWriteFnSource()`
(including the broadcast of `beta` on both dimensions).
2. Support `Split-K` in `GemmProgram::GenerateShaderCode()`.
3. Add cases to `GemmOptimizePackedTest` to test `Split-K` in `GEMM`.
### Motivation and Context
With this PR we can achieve about 20% improvement in
`florence-2-base-decoder-with-past-fp16` and 10% improvement in
`detr-resnet-50-fp16` on Lunar Lake iGPU.