onnxruntime
f3cc7fff - Fix MLAS qgemm dispatch and kernel regressions in quantized conv tests (#27671)

Commit
8 days ago
Fix MLAS qgemm dispatch and kernel regressions in quantized conv tests (#27671) ## Description This PR fixes longstanding MLAS issues that were causing `NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in quantized convolution paths (see https://github.com/microsoft/onnxruntime/issues/27670). The failures were not in the graph transformers themselves; they came from incorrect qgemm dispatch selection and broken backend kernel behavior in specific AVX2-VNNI and AMX paths. The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel path. It also adds MLAS regression coverage for the conv-shaped qgemm dimensions that exposed the problems. ## Summary of Changes ### Dispatch Selection Fixes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects in the AVXVNNI, AVX512VNNI, and AMX feature paths. | ### AVX2-VNNI Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM` from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm dispatch objects so they never enter the legacy `>4` row fallback path. | ### AMX Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile update pattern already used by the working smaller-row path. | ### Regression Coverage | File | Change | |------|--------| | `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm regression cases for conv-like shapes `6x30x207` and `169x30x207` in packed/non-packed and int32 or fp32 variants. | ## Root Cause There were three separate MLAS correctness issues: 1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with `U8S8` dispatch objects when newer CPU features were detected. That caused `U8U8` conv workloads to run through the wrong dispatch path. 2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`, but the assembly kernel only handled VNNI packing safely up to 4 rows. For 5- or 6-row panels it fell back to an older AVX2 path with incompatible packing and sign assumptions. 3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast path. The smaller-row AMX path was correct, but the 32-row pipelined update logic produced wrong accumulators for conv-shaped workloads and caused the remaining QDQ/NHWC failures on AMX-capable hosts. ## Why This Fix - The `platform.cpp` cleanup restores the intended `U8U8` dispatch selection on feature-rich x86 hosts. - The AVX2-VNNI stride change is a targeted mitigation that avoids the known-bad legacy fallback until that assembly path is corrected. - The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but replaces the broken 32-row implementation with a proven update pattern that matches the working smaller-row path. - The new MLAS regression tests cover the exact conv-derived qgemm shapes that exposed the bug, so future dispatch or kernel changes will fail at the MLAS layer before surfacing as transformer test regressions. ## Testing - `cd build/cuda/Release && ./onnxruntime_mlas_test --gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'` - `cd build/cuda/Release && ./onnxruntime_test_all --gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'` - Verified that the filtered transformer suite passes with AMX `U8S8` dispatch enabled. ## Motivation and Context These test failures had been present for a long time and were initially attributed to transformer rewrites because they surfaced in NHWC and QDQ test suites. Investigation showed that the optimized graphs were structurally correct and that the failures came from lower-level MLAS qgemm execution instead. Fixing the behavior in MLAS is the right layer because it restores correctness for both direct qgemm coverage and higher-level quantized conv paths. ## Checklist - [x] Tests added/updated - [x] No breaking changes - [x] CI passes
Author
Parents
Loading