Add MatMul 4bits support on GPU (#17890)

Commit

2 years ago

Add MatMul 4bits support on GPU (#17890) ### Description  Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU

Author

yufenglee

Committer

jchen351

Parents

d10046a6

onnxruntime 74145fc5 - Add MatMul 4bits support on GPU (#17890)

onnxruntime
74145fc5 - Add MatMul 4bits support on GPU (#17890)