[webgpu] Support any batch size for dp4a matmul path (#26884)
This pull request adds support for batched matrix multiplication in the
DP4A quantized matmul WebGPU kernels and their associated C++ code and
tests. The changes update the kernel code, tensor shapes, dispatch
logic, and test infrastructure to properly handle a `batch_count`
greater than 1, enabling efficient batched execution.