Add CPU QMoE 2-bit support and LUT GEMM fast path (#28185)
## Description
This PR adds `expert_weight_bits=2` support to the CPU QMoE operator and
introduces a fast path for supported block-wise shapes using MLAS LUT
GEMM. It also tightens CPU-side validation, expands test coverage for
non-trivial 2-bit behavior, and adds implementation notes for the CPU
QMoE kernel.
## Summary of Changes
### CPU QMoE Kernel
| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc` | Adds CPU
2-bit dequant support, 2-bit LUT GEMM eligibility checks, LUT
prepack/cache support, and LUT execution for FC1/FC2 on supported
block-wise shapes. Refactors the compute flow so the 2-bit LUT path is
isolated while routing and accumulation remain shared. |
| `onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.h` | Adds
CPU-side state for LUT prepacked buffers and shared compute inputs. |
| `onnxruntime/contrib_ops/cpu/moe/moe_helper.h` | Tightens shape
validation, including `hidden_size % pack_size == 0` and inferred
`inter_size` divisibility checks. |
### Schema and Documentation
| File | Change |
|------|--------|
| `onnxruntime/core/graph/contrib_ops/contrib_defs.cc` | Updates QMoE
schema/docs to allow CPU-side 2-bit weights. |
| `docs/contrib_ops/cpu/qmoe.md` | Adds CPU QMoE implementation notes
covering routing, quantization layouts, prepack behavior, LUT fast
paths, fallbacks, and current limitations. |
### Tests
| File | Change |
|------|--------|
| `onnxruntime/test/contrib_ops/moe_test.cc` | Adds CPU 2-bit smoke,
validation, non-zero functional, and LUT-eligible block-wise identity
tests. |
| `onnxruntime/test/python/transformers/test_qmoe_cpu.py` | Extends
Python-side QMoE parity coverage for 2-bit row-wise and block-wise
packing paths. |
## Testing
- Built the provider object:
- `ninja -C build/cu128/Release
CMakeFiles/onnxruntime_providers.dir/home/tlwu/git/onnxruntime/onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc.o`
- Built the provider test object:
- `ninja -C build/cu128/Release
CMakeFiles/onnxruntime_provider_test.dir/home/tlwu/git/onnxruntime/onnxruntime/test/contrib_ops/moe_test.cc.o`
- Added CPU-side test coverage for:
- 2-bit validation failures
- non-trivial non-zero 2-bit outputs
- LUT-eligible 2-bit block-wise identity behavior
- Full end-to-end provider gtest execution was not run from this
checkout because the available top-level test binary does not expose the
`MoETest` suite here.
## Motivation and Context
This work addresses CPU-provider support for QMoE 2-bit expert weights,
matching the issue request for QMoE 2 bits on CPU. The PR also aligns
the CPU implementation with how MLAS currently exposes optimized 2-bit
execution: block-wise 2-bit shapes can use LUT GEMM, while unsupported
shapes continue to use dequantize-plus-GEMM fallback paths.
## Checklist
- [x] Tests added/updated
- [x] Documentation updated
- [x] No breaking changes
- [ ] CI passes
---------
Co-authored-by: Copilot <copilot@github.com>