Add support for QMoE in CPU (#25558)
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.
### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)
### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.
### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.
### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)
These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.
---------
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>