Fix QMoE CPU Operator (#27360)
This PR addresses several issues in the QMoE CPU implementation,
improves MLAS documentation.
## Changes
### 1. QMoE CPU Operator Fixes
- **Corrected Bias Handling**: Renamed `fc2_bias_handled_by_q4_gemm` to
`fc2_bias_added_by_mlas` and updated the logic to consistently track
whether FC2 bias has been applied. This ensures that bias is not
double-counted or missed when using `DirectQ4Gemm`.
- **SwiGLU Attribute Update**: Switched from `swiglu_interleaved` to
`swiglu_fusion` in both the C++ operator and the Python test
infrastructure to align with the latest QMoE implementation standards.
### 2. MLAS Documentation
- **Clarified Buffer Shapes**: Added explicit documentation to
`MlasQ4GemmPackB` to specify that the input `FpData` buffer expects a
shape of `[K, N]`. This helps prevent layout-related errors in future
integrations.
### 3. Test Updates
- **PyTorch Parity Fixes**: Refactored
`onnxruntime/test/python/transformers/test_qmoe_cpu.py` to use
`swiglu_fusion` and improved the test structure for better parity checks
with PyTorch.
## Verification
- Verified by running `test_qmoe_cpu.py` to ensure all QMoE parity tests
pass on CPU.