Improve Pre Packing for 2 bit LUT kernels (#27131)
## Description
This PR improves the pre-packing performance for SQNBitGemm LUT (Lookup
Table) GEMM operations by replacing scalar implementations with
AVX2-optimized kernels, and adds benchmarking infrastructure to measure
performance.
### AVX2 Optimized Weight Packing
- Added `PackQuantBData_avx2()` - AVX2 optimized weight packing that
performs bit-plane decomposition and multi-reshape/transpose operations
using SIMD instructions
- Added `PackScalesAndZeroPoints_avx2()` - AVX2 optimized scales and
zero points packing with template specialization for `HasZeroPoint`
cases
- Registered new dispatch functions in `MlasLutGenKernelAvx2` dispatch
structure
### Refactored Dispatch Architecture
- Moved complex scalar packing logic from `qlutgemm.cpp` to
dispatch-based architecture
- Added new function signatures: `MLAS_QNBIT_LUT_PACK_QUANTB_DATA` and
`MLAS_QNBIT_LUT_PACK_SCALES_AND_ZP`
- Extended `MLAS_QNBIT_LUT_GEMM_DISPATCH` structure with
`PackQuantBData` and `PackScalesAndZeroPoints` function pointers
- Added thread pool support to `LutPackScalesAndZeroPoints()`
### Benchmarking
- Added `LUTGEMM_PACK` benchmark for measuring weight packing
performance
- Added `LUTGEMM_COMPUTE` benchmark for measuring GEMM compute
performance
- Configurable parameters: `BlkLen`, `M`, `N`, `K`, `Threads`,
`HasZeroPoint`
### Test Updates
- Relaxed constraint from `M < BlkLen || N < BlkLen` to `N < BlkLen` to
allow `M=1` cases
- Added test cases for `M=1` configurations (`1x128x128`, `1x1024x1024`)
---------
Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>