Fix QMoE CPU Operator (#27360)

Commit

160 days ago

Fix QMoE CPU Operator (#27360) This PR addresses several issues in the QMoE CPU implementation, improves MLAS documentation. ## Changes ### 1. QMoE CPU Operator Fixes - **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to `fc2_bias_added_by_mlas` and updated the logic to consistently track whether FC2 bias has been applied. This ensures that bias is not double-counted or missed when using `DirectQ4Gemm`. - **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to `swiglu_fusion` in both the C++ operator and the Python test infrastructure to align with the latest QMoE implementation standards. ### 2. MLAS Documentation - **Clarified Buffer Shapes**: Added explicit documentation to `MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a shape of `[K, N]`. This helps prevent layout-related errors in future integrations. ### 3. Test Updates - **PyTorch Parity Fixes**: Refactored `onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use `swiglu_fusion` and improved the test structure for better parity checks with PyTorch. ## Verification - Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests pass on CPU.

References

#27360 - Fix QMoE CPU Operator

Author

tianleiwu

Parents

decd177d

onnxruntime 55f8234c - Fix QMoE CPU Operator (#27360)

onnxruntime
55f8234c - Fix QMoE CPU Operator (#27360)