Add fp16 support for 8-bit MatMulNBits on ARM64 and fix pre-existing bugs (#27692)
### Description
This PR adds fp16 (half-precision) support for 8-bit MatMulNBits on
ARM64 NEON and fixes several pre-existing bugs discovered during
testing.
**New features:**
- **HQNBIT_CompFp16 for 8-bit:** Added
`HQ8BitGemmPackQuantBData_CompFp16` and
`HQ8BitBlkDequantBForHgemm_CompFp16` NEON kernels that pack and
dequantize 8-bit quantized weights for fp16 GEMM. Reuses the existing
`HQ4BitGemmKernel_CompFp16` for the actual compute since the dequantized
B matrix has the same layout.
- **HQNBIT_CompInt8 for 4-bit:** Added accuracy level 4 (int8 compute)
support for fp16 4-bit MatMulNBits. Converts fp16 activations to fp32,
then uses the existing SQ4Bit int8 kernels.
- **HQNBIT_CompInt8 for 8-bit:** Added accuracy level 4 (int8 compute)
support for fp16 8-bit MatMulNBits. Converts fp16 scales to fp32 for
packing, then uses the existing SQ8Bit int8 kernels.
**Bug fixes:**
- **Bias offset bug in CompFp16 (Windows ARM multithreading):** Fixed
missing `+ RangeStartN` when initializing `Bias` pointer in
`HQ4BitGemm_CompFp16` and `HQ8BitGemm_CompFp16`. This caused incorrect
results when using multiple threads, as worker threads processing column
ranges beyond the first would read bias values from the wrong offset.
- **QuantBDataWorkspace not set for MLFloat16 fallback (macOS ARM
crash):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around setting
`QuantBDataWorkspace` in `ComputeBPacked<MLFloat16>`, so macOS ARM
(which uses the fp32 fallback path) correctly sets the workspace pointer
for SQNBIT_CompInt8.
- **Scale/ZP packing skipped on non-x64 in MLFloat16 PrePack (macOS ARM
gibberish):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around the
SQNBIT_CompInt8 scale and zero-point packing in the
`MatMulNBits<MLFloat16>::PrePack` specialization. Added `nbits_ == 8`
condition to match the generic template's behavior on ARM (only 8-bit
needs separate scale packing on ARM, while x64 needs it for both 4-bit
and 8-bit).
### Motivation and Context
8-bit quantized models with fp16 inputs are increasingly common on ARM
devices (Windows ARM, macOS Apple Silicon). The existing MatMulNBits
implementation only supported 4-bit for the HQNBIT fp16 paths. This
change extends support to 8-bit, enabling faster inference for 8-bit
quantized models on ARM64 without requiring fp16→fp32 conversion of the
weights.
The bug fixes address issues that were either pre-existing (the `#ifdef`
guards were copy-paste inconsistencies from prior PRs) or introduced
alongside the fp16 NEON support (the Bias offset issue). These caused
crashes or incorrect output on macOS ARM and multithreaded Windows ARM
configurations.
### Improvements
Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU`
#### Accuracy level 4 (uses HQNBIT_CompInt8) vs Accuracy level 1 (uses
HQNBIT_CompFp16)
| Model | Seq 1 | Seq 256 | Seq 512 |
|-------|-------|---------|---------|
| **4-bit** | | | |
| Qwen 0.5B | 1.19× (9.6ms) | 1.36× (428ms) | 1.27× (1119ms) |
| Qwen 1.5B | 0.89× (39.8ms) | 1.62× (1371ms) | 1.54× (2694ms) |
| Qwen 3B | 1.16× (46.8ms) | 1.54× (2654ms) | 1.43× (5427ms) |
| **8-bit** | | | |
| Qwen 0.5B | 0.79× (22.5ms) | 2.59× (257ms) | 2.16× (642ms) |
| Qwen 1.5B | 1.14× (41.4ms) | 2.50× (848ms) | 2.55× (1636ms) |
| Qwen 3B | 1.07× (52.9ms) | 1.95× (2133ms) | 2.29× (3799ms) |
#### Latest changes vs ORT 1.24.3 (both accuracy level 4)
On ORT 1.24.3:
- 4 bit uses HQNBIT_CompFp16
- 8 bit uses naive unpacked dequantize and matmul
| Model | Seq 1 | Seq 256 | Seq 512 |
|-------|-------|---------|---------|
| **4-bit** | | | |
| Qwen 0.5B | 1.13× (9.6ms) | 1.35× (428ms) | 1.27× (1119ms) |
| Qwen 1.5B | 0.82× (39.8ms) | 1.40× (1371ms) | 1.47× (2694ms) |
| Qwen 3B | 1.16× (46.8ms) | 1.47× (2654ms) | 1.51× (5427ms) |
| **8-bit** | | | |
| Qwen 0.5B | **35.4×** (22.5ms) | **5.0×** (257ms) | **3.2×** (642ms) |
| Qwen 1.5B | **98.0×** (41.4ms) | **6.8×** (848ms) | **4.7×** (1636ms)
|
| Qwen 3B | **107.8×** (52.9ms) | **4.1×** (2133ms) | **3.1×** (3799ms)
|