onnxruntime
c1f38c03 - Add fp16 support for 8-bit MatMulNBits on ARM64 and fix pre-existing bugs (#27692)

Commit
22 hours ago
Add fp16 support for 8-bit MatMulNBits on ARM64 and fix pre-existing bugs (#27692) ### Description This PR adds fp16 (half-precision) support for 8-bit MatMulNBits on ARM64 NEON and fixes several pre-existing bugs discovered during testing. **New features:** - **HQNBIT_CompFp16 for 8-bit:** Added `HQ8BitGemmPackQuantBData_CompFp16` and `HQ8BitBlkDequantBForHgemm_CompFp16` NEON kernels that pack and dequantize 8-bit quantized weights for fp16 GEMM. Reuses the existing `HQ4BitGemmKernel_CompFp16` for the actual compute since the dequantized B matrix has the same layout. - **HQNBIT_CompInt8 for 4-bit:** Added accuracy level 4 (int8 compute) support for fp16 4-bit MatMulNBits. Converts fp16 activations to fp32, then uses the existing SQ4Bit int8 kernels. - **HQNBIT_CompInt8 for 8-bit:** Added accuracy level 4 (int8 compute) support for fp16 8-bit MatMulNBits. Converts fp16 scales to fp32 for packing, then uses the existing SQ8Bit int8 kernels. **Bug fixes:** - **Bias offset bug in CompFp16 (Windows ARM multithreading):** Fixed missing `+ RangeStartN` when initializing `Bias` pointer in `HQ4BitGemm_CompFp16` and `HQ8BitGemm_CompFp16`. This caused incorrect results when using multiple threads, as worker threads processing column ranges beyond the first would read bias values from the wrong offset. - **QuantBDataWorkspace not set for MLFloat16 fallback (macOS ARM crash):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around setting `QuantBDataWorkspace` in `ComputeBPacked<MLFloat16>`, so macOS ARM (which uses the fp32 fallback path) correctly sets the workspace pointer for SQNBIT_CompInt8. - **Scale/ZP packing skipped on non-x64 in MLFloat16 PrePack (macOS ARM gibberish):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around the SQNBIT_CompInt8 scale and zero-point packing in the `MatMulNBits<MLFloat16>::PrePack` specialization. Added `nbits_ == 8` condition to match the generic template's behavior on ARM (only 8-bit needs separate scale packing on ARM, while x64 needs it for both 4-bit and 8-bit). ### Motivation and Context 8-bit quantized models with fp16 inputs are increasingly common on ARM devices (Windows ARM, macOS Apple Silicon). The existing MatMulNBits implementation only supported 4-bit for the HQNBIT fp16 paths. This change extends support to 8-bit, enabling faster inference for 8-bit quantized models on ARM64 without requiring fp16→fp32 conversion of the weights. The bug fixes address issues that were either pre-existing (the `#ifdef` guards were copy-paste inconsistencies from prior PRs) or introduced alongside the fp16 NEON support (the Bias offset issue). These caused crashes or incorrect output on macOS ARM and multithreaded Windows ARM configurations. ### Improvements Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU` #### Accuracy level 4 (uses HQNBIT_CompInt8) vs Accuracy level 1 (uses HQNBIT_CompFp16) | Model | Seq 1 | Seq 256 | Seq 512 | |-------|-------|---------|---------| | **4-bit** | | | | | Qwen 0.5B | 1.19× (9.6ms) | 1.36× (428ms) | 1.27× (1119ms) | | Qwen 1.5B | 0.89× (39.8ms) | 1.62× (1371ms) | 1.54× (2694ms) | | Qwen 3B | 1.16× (46.8ms) | 1.54× (2654ms) | 1.43× (5427ms) | | **8-bit** | | | | | Qwen 0.5B | 0.79× (22.5ms) | 2.59× (257ms) | 2.16× (642ms) | | Qwen 1.5B | 1.14× (41.4ms) | 2.50× (848ms) | 2.55× (1636ms) | | Qwen 3B | 1.07× (52.9ms) | 1.95× (2133ms) | 2.29× (3799ms) | #### Latest changes vs ORT 1.24.3 (both accuracy level 4) On ORT 1.24.3: - 4 bit uses HQNBIT_CompFp16 - 8 bit uses naive unpacked dequantize and matmul | Model | Seq 1 | Seq 256 | Seq 512 | |-------|-------|---------|---------| | **4-bit** | | | | | Qwen 0.5B | 1.13× (9.6ms) | 1.35× (428ms) | 1.27× (1119ms) | | Qwen 1.5B | 0.82× (39.8ms) | 1.40× (1371ms) | 1.47× (2694ms) | | Qwen 3B | 1.16× (46.8ms) | 1.47× (2654ms) | 1.51× (5427ms) | | **8-bit** | | | | | Qwen 0.5B | **35.4×** (22.5ms) | **5.0×** (257ms) | **3.2×** (642ms) | | Qwen 1.5B | **98.0×** (41.4ms) | **6.8×** (848ms) | **4.7×** (1636ms) | | Qwen 3B | **107.8×** (52.9ms) | **4.1×** (2133ms) | **3.1×** (3799ms) |
Author
Parents
Loading