QMoE CPU Performance Update (Up to 4x on 4-bit) (#27364)
## Summary
This change improves QMoE CPU performance by moving more work to prepack
time and enabling the DirectQ4 GEMM fast path where appropriate, while
preserving an env-var switch for performance/accuracy A/B testing.
This PR introduces:
- Prepack and cache infrastructure for QMoE expert weights.
- DirectQ4 packed-B cache built during prepack (instead of mutable
runtime cache in `Compute()`).
- Fast-path support for block-wise cases (including block size 32 where
supported by MLAS Q4 type).
- Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`.
- Default fast-path policy refined to avoid known accuracy-loss
scenarios unless explicitly overridden by env var.
- Test and benchmark refinements for QMoE CPU validation.
## Key Implementation Changes
### 1. Prepack-time cache build
- Moves DirectQ4 packed-B cache construction to prepack stage.
- Removes mutable runtime cache maintenance from `Compute()`.
- Reduces per-inference overhead and avoids mutable shared cache
complexity.
### 2. Fast path vs fallback
- Keeps two execution modes:
- DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache
usage).
- Fallback path (`DequantizePrePacked` + `MlasGemm`).
- Allows controlled fallback for accuracy-sensitive configurations.
### 3. Environment variable behavior
- `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported.
- `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path.
- Unset: use default policy that enables fast path unless a known
accuracy-loss pattern is detected.
### 4. Test updates
- QMoE CPU tests were refined to validate env-var on/off behavior and
no-env behavior.
- Coverage includes parity checks for symmetric/asymmetric,
row-wise/block-wise settings.
## Benchmark Results (1000 inferences, `benchmark_qmoe.py`)
Note: PyTorch latency fluctuates across runs and is excluded from
conclusions below.
### ORT results comparison
| Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time
(env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New
ORT tok/s (env=1) |
|---|---:|---:|---:|---:|---:|---:|
| Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 |
| Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 |
### ORT speedup vs baseline
| Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline
(time) |
|---|---:|---:|
| Medium-4bit | 3.16x faster | 4.18x faster |
| Medium-8bit | 0.99x (about flat) | 1.03x faster |
## Accuracy Notes
- `env=1` (forced fast path) provides the best 4-bit performance but may
show non-zero max diff in known cases.
- `env=0` (fallback) maintains parity behavior with zero observed max
diff in the reported benchmark table.
- Default no-env policy is designed to avoid known accuracy-loss cases
while still enabling fast path where safe.