onnxruntime
358628a8 - webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834)

Commit

46 days ago

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834) Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

References

#27834 - webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32

Author

qjia7

Parents

f869122a

onnxruntime 358628a8 - webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834)

onnxruntime
358628a8 - webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834)