onnxruntime
b2a6e69e - QMoE CPU Performance Update (Up to 4x on 4-bit) (#27364)

Commit

59 days ago

QMoE CPU Performance Update (Up to 4x on 4-bit) (#27364) ## Summary This change improves QMoE CPU performance by moving more work to prepack time and enabling the DirectQ4 GEMM fast path where appropriate, while preserving an env-var switch for performance/accuracy A/B testing. This PR introduces: - Prepack and cache infrastructure for QMoE expert weights. - DirectQ4 packed-B cache built during prepack (instead of mutable runtime cache in `Compute()`). - Fast-path support for block-wise cases (including block size 32 where supported by MLAS Q4 type). - Runtime toggle via `ORT_USE_MLAS_Q4_GEMM_MOE`. - Default fast-path policy refined to avoid known accuracy-loss scenarios unless explicitly overridden by env var. - Test and benchmark refinements for QMoE CPU validation. ## Key Implementation Changes ### 1. Prepack-time cache build - Moves DirectQ4 packed-B cache construction to prepack stage. - Removes mutable runtime cache maintenance from `Compute()`. - Reduces per-inference overhead and avoids mutable shared cache complexity. ### 2. Fast path vs fallback - Keeps two execution modes: - DirectQ4 GEMM fast path (`MlasQ4GemmPackB` + `DirectQ4Gemm` cache usage). - Fallback path (`DequantizePrePacked` + `MlasGemm`). - Allows controlled fallback for accuracy-sensitive configurations. ### 3. Environment variable behavior - `ORT_USE_MLAS_Q4_GEMM_MOE=1`: force fast path when supported. - `ORT_USE_MLAS_Q4_GEMM_MOE=0`: force fallback path. - Unset: use default policy that enables fast path unless a known accuracy-loss pattern is detected. ### 4. Test updates - QMoE CPU tests were refined to validate env-var on/off behavior and no-env behavior. - Coverage includes parity checks for symmetric/asymmetric, row-wise/block-wise settings. ## Benchmark Results (1000 inferences, `benchmark_qmoe.py`) Note: PyTorch latency fluctuates across runs and is excluded from conclusions below. ### ORT results comparison | Config | Baseline ORT Time (ms) | Baseline ORT tok/s | New ORT Time (env=0) (ms) | New ORT tok/s (env=0) | New ORT Time (env=1) (ms) | New ORT tok/s (env=1) | |---|---:|---:|---:|---:|---:|---:| | Medium-4bit | 748.594 | 1.3 | 237.219 | 4.2 | 178.943 | 5.6 | | Medium-8bit | 209.277 | 4.8 | 212.074 | 4.7 | 203.882 | 4.9 | ### ORT speedup vs baseline | Config | env=0 speedup vs baseline (time) | env=1 speedup vs baseline (time) | |---|---:|---:| | Medium-4bit | 3.16x faster | 4.18x faster | | Medium-8bit | 0.99x (about flat) | 1.03x faster | ## Accuracy Notes - `env=1` (forced fast path) provides the best 4-bit performance but may show non-zero max diff in known cases. - `env=0` (fallback) maintains parity behavior with zero observed max diff in the reported benchmark table. - Default no-env policy is designed to avoid known accuracy-loss cases while still enabling fast path where safe.

References

#27364 - QMoE CPU Performance Update (Up to 4x on 4-bit)

Author

tianleiwu

Parents

0f938536

onnxruntime b2a6e69e - QMoE CPU Performance Update (Up to 4x on 4-bit) (#27364)

onnxruntime
b2a6e69e - QMoE CPU Performance Update (Up to 4x on 4-bit) (#27364)