onnxruntime
dac5a143 - Add CPU QMoE 2-bit support and LUT GEMM fast path (#28185)

Commit
26 days ago
Add CPU QMoE 2-bit support and LUT GEMM fast path (#28185) ## Description This PR adds `expert_weight_bits=2` support to the CPU QMoE operator and introduces a fast path for supported block-wise shapes using MLAS LUT GEMM. It also tightens CPU-side validation, expands test coverage for non-trivial 2-bit behavior, and adds implementation notes for the CPU QMoE kernel. ## Summary of Changes ### CPU QMoE Kernel | File | Change | |------|--------| | `onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc` | Adds CPU 2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT prepack/cache support, and LUT execution for FC1/FC2 on supported block-wise shapes. Refactors the compute flow so the 2-bit LUT path is isolated while routing and accumulation remain shared. | | `onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h` | Adds CPU-side state for LUT prepacked buffers and shared compute inputs. | | `onnxruntime/contrib_ops/cpu/moe/moe_helper.h` | Tightens shape validation, including `hidden_size % pack_size == 0` and inferred `inter_size` divisibility checks. | ### Schema and Documentation | File | Change | |------|--------| | `onnxruntime/core/graph/contrib_ops/contrib_defs.cc` | Updates QMoE schema/docs to allow CPU-side 2-bit weights. | | `docs/contrib_ops/cpu/qmoe.md` | Adds CPU QMoE implementation notes covering routing, quantization layouts, prepack behavior, LUT fast paths, fallbacks, and current limitations. | ### Tests | File | Change | |------|--------| | `onnxruntime/test/contrib_ops/moe_test.cc` | Adds CPU 2-bit smoke, validation, non-zero functional, and LUT-eligible block-wise identity tests. | | `onnxruntime/test/python/transformers/test_qmoe_cpu.py` | Extends Python-side QMoE parity coverage for 2-bit row-wise and block-wise packing paths. | ## Testing - Built the provider object: - `ninja -C build/cu128/Release CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o` - Built the provider test object: - `ninja -C build/cu128/Release CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o` - Added CPU-side test coverage for: - 2-bit validation failures - non-trivial non-zero 2-bit outputs - LUT-eligible 2-bit block-wise identity behavior - Full end-to-end provider gtest execution was not run from this checkout because the available top-level test binary does not expose the `MoETest` suite here. ## Motivation and Context This work addresses CPU-provider support for QMoE 2-bit expert weights, matching the issue request for QMoE 2 bits on CPU. The PR also aligns the CPU implementation with how MLAS currently exposes optimized 2-bit execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported shapes continue to use dequantize-plus-GEMM fallback paths. ## Checklist - [x] Tests added/updated - [x] Documentation updated - [x] No breaking changes - [ ] CI passes --------- Co-authored-by: Copilot <copilot@github.com>
Author
Parents
Loading