onnxruntime
358628a8 - webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834)

Commit
5 days ago
webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 (#27834) Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.
Author
Parents
Loading