onnxruntime
09b5695c - Fix DQ→MatMulNBits fusion for FP16 models on CPU EP (#27640)

Commit
11 days ago
Fix DQ→MatMulNBits fusion for FP16 models on CPU EP (#27640) ### Description For FP16 models with block-quantized weights (`DQ(int4/int2/int8, fp16_scale) → MatMul(fp16)`), the `DQMatMulToMatMulNBitsSelector` failed to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP during graph partitioning, leaving their execution provider unassigned (empty string `""`). The selector's EP compatibility check rejected these nodes. This PR: - Adds `""` (empty/unassigned EP) to the compatible providers list for `DQMatMulToMatMulNBitsSelector` so it can match FP16 MatMul nodes not yet assigned to an EP. The resulting `MatMulNBits` node is assigned to `kCpuExecutionProvider` by the action (which has both `float` and `MLFloat16` CPU kernels). - Adds `""` to the `QDQSelectorActionTransformer` transformer-level compatible EPs so unassigned nodes reach individual selectors (other selectors are unaffected since their own provider lists don't include `""`). - Removes the `DQCastMatMulToMatMulNBitsSelector` and `DQCastMatMulToMatMulNBitsAction`, which handled a `DQ → Cast(fp16→fp32) → MatMul` pattern that only existed after `InsertCastTransformer` ran. That fusion only worked incidentally when `FuseInitializersTransformer` (Level 4) triggered an optimization loop repeat, giving Level 2 QDQ fusions a second pass — a behavior that didn't occur in all builds (e.g., minimal/extended-minimal builds without `FuseInitializersTransformer`). - Replaces the `DQCastMatMulConvertedToMatMulNBits` test with `DQMatMulFP16ConvertedToMatMulNBits` that tests the actual scenario: `DQ(int4, fp16_scale) → MatMul(fp16)` on CPU EP. ### Motivation and Context FP16 models with block-quantized weights were not getting `DQ → MatMulNBits` fusion when running on CPU EP in certain ORT builds. The fusion worked on x64 full builds by luck — `InsertCastTransformer` created `DQ→Cast→MatMul` patterns, then `FuseInitializersTransformer` (Level 4) modified FP16 initializers causing the optimization loop to repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware selector matched. In builds without `FuseInitializersTransformer` (e.g., minimal builds, arm packages), the loop didn't repeat and the fusion never applied. The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't claim FP16 MatMul nodes during partitioning. These nodes have an empty EP string, which the `QDQSelectorActionTransformer` and `BaseSelector` both rejected. The fix allows the `DQMatMulToMatMulNBits` selector to match unassigned nodes directly on the first Level 2 pass, before `InsertCastTransformer` runs, eliminating the dependency on the optimization loop repeat.
Author
Parents
Loading