Route fp16 HQNBIT_CompInt8 (4-bit and 8-bit) through fp32 MLAS path in MatMulNBits (#27820)
### Description
Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path
(`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit
MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper
code from MLAS.
**Operator changes (matmul_nbits.cc):**
- PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16
scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on
ARM64.
- ComputeBPacked: Bulk fp16→fp32 conversion of A, calls
`MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16
conversion of C.
**MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):**
- Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`,
`HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch
branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel
conditions.
- Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in
`MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>`
compatibility.
### Motivation and Context
The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile
before calling the same SQ fp32 kernels. This change:
1. **Eliminates per-tile overhead** via bulk conversion at the operator
level.
2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the
`HQNBIT_CompInt8` path.
3. **Removes ~370 lines of dead wrapper code** from MLAS.
### Improvements
Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU`
**Asymmetric:**
| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |
|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms |
| Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms |
| Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms |
| Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms |
| Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms |
| Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms |
**Symmetric:**
| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |
|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms |
| Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms |
| Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms |
| Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms |
| Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms |
| Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms |
**NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25%
on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion
overhead replacing the old per-tile approach. The old HQ CompInt8
wrappers kept small tiles cache-hot, while the new unified path does
full-matrix conversion passes. This trade-off is acceptable since 4-bit
is the dominant quantization format (gaining 26–67%), 8-bit acc4 still
outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at
smaller model sizes where absolute latencies are already low. A proper
fix would be 8-bit KleidiAI-style kernels rather than restoring the
wrapper code.