[webgpu] make DP4AMatMulNBitsSmallMProgram shader template (#25025)
### Description
This commit refactors the `DP4AMatMulNBitsSmallMProgram` to allow both
`tile_size_k_vec` and `tile_size` to be configured. This change allows
more flexibility for performance tuning without altering the core shader
functionality.
There is no functional change in this commit.
### Motivation and Context
This is a preparatory change to enable `DP4AMatMulNBitsSmallMProgram`
performance optimization work in subsequent commits.
---------
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>