Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695)
### Description
This PR introduces a new experimental lookup-table(LUT) based matrix
multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from
[T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC
repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM
inference.
Unlike the existing quant-dequant methods, the LUT-based method directly
supports mixed-precision-GEMM without dequantization. It uses bit-wise
table lookup to eliminate multiplications and reduce additions required
in matrix multiplication.
<img width="1910" height="759" alt="image"
src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202"
/>
This PR:
- Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside
matmulnbits when it is available (2-bit, BlkLen multiple of 32, K
multiple of 32, N multiple of 128, AVX2 present).
- Introduces LUT packing + kernel config cache (packs bitplanes, scales,
ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and
calls the AVX2 kernel.
- Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute
`TMACComputeGemm_avx2` and wires dispatch in MLAS platform init.
- Updates MatMulNBits PrePack/Compute to use LUT packing/compute when
opted-in; keeps existing quant-dequant path as fallback.
- Extends Python quant bindings with 2-bit QDQ helper for parity with
the new path.
- Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric
quant and multiple shapes/block sizes.
### Main components:
- `MlasInitLUTGemmKernelConfig`: Config for LUT kernels
- `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight
- `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and
zero points
- `MlasLUTGemm`: Main Entry point
- `GenerateLUT_avx2`: LUT construction from activations
- `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel
- Session option: mlas.use_lut_gemm
### How to test
- MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp`
- Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on
AVX2 machines; expect fallback to existing path if availability checks
fail.
### Perf
Focus of this PR is functional + kernel bring-up; perf to be reported
separately once broader profiling is done.
### Future Work
- Support MLFloat16 (FP16 scales and zero points)
- Add neon kernel for ARM.
- Add kernels for 4 bit weights and bitnet kernels
- Broader batch (N>1) support and additional shape coverage.
---------
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: carzh <wolfivyaura@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>