[webgpu] Optimize MatMulNBits for f16 Block32 prefill performance (#23908)
### Description
This commit improve the MatMulNBits f16 Block32 prefill performance, by
increasing tiling size and enhancing memory efficiency. Achieved a +2x
performance boost on Intel iGPUs for Phi-3.5-mini f16 model.
### Motivation and Context
See above.