Fix MLAS qgemm dispatch and kernel regressions in quantized conv tests (#27671)
## Description
This PR fixes longstanding MLAS issues that were causing
`NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in
quantized convolution paths (see
https://github.com/microsoft/onnxruntime/issues/27670). The failures
were not in the graph transformers themselves; they came from incorrect
qgemm dispatch selection and broken backend kernel behavior in specific
AVX2-VNNI and AMX paths.
The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken
AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel
path. It also adds MLAS regression coverage for the conv-shaped qgemm
dimensions that exposed the problems.
## Summary of Changes
### Dispatch Selection Fixes
| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect
assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects
in the AVXVNNI, AVX512VNNI, and AMX feature paths. |
### AVX2-VNNI Kernel Fix
| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM`
from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm
dispatch objects so they never enter the legacy `>4` row fallback path.
|
### AMX Kernel Fix
| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken
pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile
update pattern already used by the working smaller-row path. |
### Regression Coverage
| File | Change |
|------|--------|
| `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm
regression cases for conv-like shapes `6x30x207` and `169x30x207` in
packed/non-packed and int32 or fp32 variants. |
## Root Cause
There were three separate MLAS correctness issues:
1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with
`U8S8` dispatch objects when newer CPU features were detected. That
caused `U8U8` conv workloads to run through the wrong dispatch path.
2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`,
but the assembly kernel only handled VNNI packing safely up to 4 rows.
For 5- or 6-row panels it fell back to an older AVX2 path with
incompatible packing and sign assumptions.
3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast
path. The smaller-row AMX path was correct, but the 32-row pipelined
update logic produced wrong accumulators for conv-shaped workloads and
caused the remaining QDQ/NHWC failures on AMX-capable hosts.
## Why This Fix
- The `platform.cpp` cleanup restores the intended `U8U8` dispatch
selection on feature-rich x86 hosts.
- The AVX2-VNNI stride change is a targeted mitigation that avoids the
known-bad legacy fallback until that assembly path is corrected.
- The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but
replaces the broken 32-row implementation with a proven update pattern
that matches the working smaller-row path.
- The new MLAS regression tests cover the exact conv-derived qgemm
shapes that exposed the bug, so future dispatch or kernel changes will
fail at the MLAS layer before surfacing as transformer test regressions.
## Testing
- `cd build/cuda/Release && ./onnxruntime_mlas_test
--gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'`
- `cd build/cuda/Release && ./onnxruntime_test_all
--gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'`
- Verified that the filtered transformer suite passes with AMX `U8S8`
dispatch enabled.
## Motivation and Context
These test failures had been present for a long time and were initially
attributed to transformer rewrites because they surfaced in NHWC and QDQ
test suites. Investigation showed that the optimized graphs were
structurally correct and that the failures came from lower-level MLAS
qgemm execution instead. Fixing the behavior in MLAS is the right layer
because it restores correctness for both direct qgemm coverage and
higher-level quantized conv paths.
## Checklist
- [x] Tests added/updated
- [x] No breaking changes
- [x] CI passes