onnxruntime
2145c8c0 - Improve Pre Packing for 2 bit LUT kernels (#27131)

Commit

142 days ago

Improve Pre Packing for 2 bit LUT kernels (#27131) ## Description This PR improves the pre-packing performance for SQNBitGemm LUT (Lookup Table) GEMM operations by replacing scalar implementations with AVX2-optimized kernels, and adds benchmarking infrastructure to measure performance. ### AVX2 Optimized Weight Packing - Added `PackQuantBData_avx2()` - AVX2 optimized weight packing that performs bit-plane decomposition and multi-reshape/transpose operations using SIMD instructions - Added `PackScalesAndZeroPoints_avx2()` - AVX2 optimized scales and zero points packing with template specialization for `HasZeroPoint` cases - Registered new dispatch functions in `MlasLutGenKernelAvx2` dispatch structure ### Refactored Dispatch Architecture - Moved complex scalar packing logic from `qlutgemm.cpp` to dispatch-based architecture - Added new function signatures: `MLAS_QNBIT_LUT_PACK_QUANTB_DATA` and `MLAS_QNBIT_LUT_PACK_SCALES_AND_ZP` - Extended `MLAS_QNBIT_LUT_GEMM_DISPATCH` structure with `PackQuantBData` and `PackScalesAndZeroPoints` function pointers - Added thread pool support to `LutPackScalesAndZeroPoints()` ### Benchmarking - Added `LUTGEMM_PACK` benchmark for measuring weight packing performance - Added `LUTGEMM_COMPUTE` benchmark for measuring GEMM compute performance - Configurable parameters: `BlkLen`, `M`, `N`, `K`, `Threads`, `HasZeroPoint` ### Test Updates - Relaxed constraint from `M < BlkLen || N < BlkLen` to `N < BlkLen` to allow `M=1` cases - Added test cases for `M=1` configurations (`1x128x128`, `1x1024x1024`) --------- Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

References

#27131 - Improve Pre Packing for 2 bit LUT kernels

Author

vraspar

Parents

79e0676e

onnxruntime 2145c8c0 - Improve Pre Packing for 2 bit LUT kernels (#27131)

onnxruntime
2145c8c0 - Improve Pre Packing for 2 bit LUT kernels (#27131)