onnxruntime
8e050d16 - Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695)

Commit
75 days ago
Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695) ### Description This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from [T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM inference. Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication. <img width="1910" height="759" alt="image" src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202" /> This PR: - Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present). - Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and calls the AVX2 kernel. - Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute `TMACComputeGemm_avx2` and wires dispatch in MLAS platform init. - Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback. - Extends Python quant bindings with 2-bit QDQ helper for parity with the new path. - Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes. ### Main components: - `MlasInitLUTGemmKernelConfig`: Config for LUT kernels - `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight - `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and zero points - `MlasLUTGemm`: Main Entry point - `GenerateLUT_avx2`: LUT construction from activations - `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel - Session option: mlas.use_lut_gemm ### How to test - MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp` - Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on AVX2 machines; expect fallback to existing path if availability checks fail. ### Perf Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done. ### Future Work - Support MLFloat16 (FP16 scales and zero points) - Add neon kernel for ARM. - Add kernels for 4 bit weights and bitnet kernels - Broader batch (N>1) support and additional shape coverage. --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>
Author
Parents
Loading