[webgpu] Optimize DP4AMatMulNBitsSmallMProgram for intel (#25192)
### Description
This PR optimizes the Intel GPU path for the
`DP4AMatMulNBitsSmallMProgram` by tuning `tile_size` and
`tile_size_k_vec`.
### Motivation and Context
With this change, we achieved >8% performance boost on Intel iGPUs
(Xe-LP and Xe2-LPG) for phi-4-mini-accuracy4 model.