Fix DQ→MatMulNBits fusion for FP16 models on CPU EP (#27640)
### Description
For FP16 models with block-quantized weights (`DQ(int4/int2/int8,
fp16_scale) → MatMul(fp16)`), the `DQMatMulToMatMulNBitsSelector` failed
to match on CPU EP because FP16 MatMul nodes are not claimed by CPU EP
during graph partitioning, leaving their execution provider unassigned
(empty string `""`). The selector's EP compatibility check rejected
these nodes.
This PR:
- Adds `""` (empty/unassigned EP) to the compatible providers list for
`DQMatMulToMatMulNBitsSelector` so it can match FP16 MatMul nodes not
yet assigned to an EP. The resulting `MatMulNBits` node is assigned to
`kCpuExecutionProvider` by the action (which has both `float` and
`MLFloat16` CPU kernels).
- Adds `""` to the `QDQSelectorActionTransformer` transformer-level
compatible EPs so unassigned nodes reach individual selectors (other
selectors are unaffected since their own provider lists don't include
`""`).
- Removes the `DQCastMatMulToMatMulNBitsSelector` and
`DQCastMatMulToMatMulNBitsAction`, which handled a `DQ → Cast(fp16→fp32)
→ MatMul` pattern that only existed after `InsertCastTransformer` ran.
That fusion only worked incidentally when `FuseInitializersTransformer`
(Level 4) triggered an optimization loop repeat, giving Level 2 QDQ
fusions a second pass — a behavior that didn't occur in all builds
(e.g., minimal/extended-minimal builds without
`FuseInitializersTransformer`).
- Replaces the `DQCastMatMulConvertedToMatMulNBits` test with
`DQMatMulFP16ConvertedToMatMulNBits` that tests the actual scenario:
`DQ(int4, fp16_scale) → MatMul(fp16)` on CPU EP.
### Motivation and Context
FP16 models with block-quantized weights were not getting `DQ →
MatMulNBits` fusion when running on CPU EP in certain ORT builds. The
fusion worked on x64 full builds by luck — `InsertCastTransformer`
created `DQ→Cast→MatMul` patterns, then `FuseInitializersTransformer`
(Level 4) modified FP16 initializers causing the optimization loop to
repeat, giving Level 2 QDQ fusions a second pass where the Cast-aware
selector matched. In builds without `FuseInitializersTransformer` (e.g.,
minimal builds, arm packages), the loop didn't repeat and the fusion
never applied.
The root cause is that CPU EP has no FP16 MatMul kernel, so it doesn't
claim FP16 MatMul nodes during partitioning. These nodes have an empty
EP string, which the `QDQSelectorActionTransformer` and `BaseSelector`
both rejected. The fix allows the `DQMatMulToMatMulNBits` selector to
match unassigned nodes directly on the first Level 2 pass, before
`InsertCastTransformer` runs, eliminating the dependency on the
optimization loop repeat.