webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834)
Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel,
doubling the number of threads working on K-dimension reduction per
output row. This improves token generation throughput by ~3% on NVIDIA
GPUs by better utilizing memory bandwidth.
Intel devices retain tile_size_k_vec=16 due to different subgroup and
cache characteristics.
Changes:
- matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to
MatMulNBitsProgram constructor.
- matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass
to program constructor.