onnxruntime
6817b013 - [MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054)

Commit

1 year ago

[MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054) ### Description 1. added kernel to quantize matmul B tensor to q4, and store in the same shape as original tensor. scales and zero points are calculated as well. scales and zero points have the same shape. 2. added kernel to transpose q4 B tensor to B tensor in MatMulNBits. Scales and zero points are transposed as well. #### Benchmark <1024 x 4096 input, 64 quant block, 8 threads>: - quantize: 23035923 ns - transpose: 718635 ns <1024 x 4095 input, 64 quant block, 8 threads>: - quantize: 26759319 ns - transpose: 1279064 ns ### Motivation and Context The MatMulNbits tool chain current only supports converting a MatMul op direct to MatMulNBits op. MatMulNbits op is not an ONNX standard op. Therefore, we need the tool chain to support converting MatMul to Q/DQ format, and later in the transform step converts DQ + MatMul to MatMulNBits. The tensors stored in DQ are the quantized constants and will be stored in the MatMulNBits.

References

#21054 - [MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse

Author

fajin-corp

Parents

8448f31d

onnxruntime 6817b013 - [MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054)

onnxruntime
6817b013 - [MLAS] add q4 quantize and transpose kernel to support MatMulNBits QDQ fuse (#21054)