[Skills][Transformations] Introduce debug-matcher-pass skill (#35127)
### Details:
Introduces a new “debug-matcher-pass” skill that documents an end-to-end
workflow for diagnosing why an OpenVINO MatcherPass transformation
doesn’t fire, including a filled example diagnosis report to use as a
quality reference.
### Usage Examples
#### Example 1:
Prompt:
> I run the following command -- `./benchmark_app -d CPU -m
<path>/openvino_model.xml -niter 0` and expect MoE3GeMMFusion matcher
pass to be applied, but it looks like the matcher pass fail. Help me to
investigate why this transformation is not applied.
<details>
<summary>The skill's output (click to expand):</summary>
## MatcherPass Diagnosis: MoE3GeMMFusion
**Root cause:** All three `MatMul` nodes (gate, up, down projections) in
the model's MoE subgraph have `transpose_b=false`, but the
`MoE3GeMMFusion` pattern hard-requires `transpose_b=true` via an
`attrs_match` predicate. No explicit `Transpose` node wraps the weight
input either, so neither of the pattern's two matching paths can
succeed.
**Log evidence:**
```
{ MATCHING PATTERN NODE: WrapType<MatMul>(WrapType, any_input)
├─ AGAINST GRAPH NODE: MatMul(Multiply, Reshape)
├─ PREDICATE `attrs_match({ transpose_b: YES, transpose_a: NO }) && consumers_count(1)` FAILED
} NODES' TYPE MATCHED, but PREDICATE FAILED
```
This phrase appears **30 times** in the matcher log — once per MoE layer
candidate — confirming all 30 match attempts fail at exactly the same
point.
**Failing node:** `ov::op::v0::MatMul` (down-projection, and
symmetrically gate/up-projection MatMuls). Op type is correct; only the
`transpose_b` attribute fails the predicate.
**Resolution:**
- File:
`src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/moe_matmuls_fusion.cpp`,
lines ~241, 248, 253
- Remove the `{"transpose_b", true}` constraint from all three
`wrap_type<MatMul>` predicate maps and instead read the actual flag
inside the callback via `gate_mm_node->get_transpose_b()`, then
propagate it to `BatchGatherMatmul`.
- Alternative: add a second pattern branch (via `pattern::op::Or`)
covering `transpose_b=false`.
## Reproducer Test
File:
`src/plugins/intel_cpu/tests/unit/transformations/moe_matmuls_fusion_test.cpp`
Test name:
`MoE3GeMMFusion_TransposeBFalse_TestF.MoE3GeMMFusion_TransposeBFalse_NotApplied`
Status before fix: **PASS (green)** — transformation does not fire;
model unchanged equals the auto-cloned `model_ref`. Matcher log from the
test run shows the identical `attrs_match({ transpose_b: YES })`
predicate failure as the original model, confirming the reproducer is
faithful.
</details>
#### Example 2:
Prompt:
> I need you to investigate why some transformations in the GPU plugin
pipeline don't work. I run the following command:
> `cd /home/guest/golubevv/openvino/bin/intel64/Release &&
./benchmark_app -d GPU -m subgraph.xml -niter 0`
> and see the issue which is most likely caused by the transformation
pipeline problems. I know that the following transformations should
fire:
>
> 1. FuseVectorizedMOE3GEMM
> 2. ConvertMOEToMOECompressed
> 3. FuseMOE3GemmCompressed
> 4. KeepMOE3GemmConstPrecision
>
> But at least one of them didn't fire. Investigate the root-cause
<details>
<summary>The skill's output (click to expand):</summary>
# MatcherPass Diagnosis: `FuseMOE3GemmCompressed`
**Date:** 2026-04-03
**Model:** Trinity-Mini (`subgraph.xml`)
**Run command:** `./benchmark_app -d GPU -m subgraph.xml -compile_only`
---
## Summary of all four passes
| Pass | Result |
|---|---|
| `FuseVectorizedMOE3GEMM` | ✅ Fired (`CALLBACK SUCCEDED`) |
| `ConvertMOEToMOECompressed` | ✅ Fired (`CALLBACK SUCCEDED`) — produced
`MOECompressed` nodes |
| **`FuseMOE3GemmCompressed`** | ❌ **Never matched** — root cause |
| `KeepMOE3GemmConstPrecision` | ❌ Never matched — downstream effect: no
`MOE3GemmFusedCompressed` exists because step 3 didn't fire; triggers
crash `Input moecompressed: ... hasn't been found in primitive_ids map`
|
---
## Root cause
The sigmoid routing normalization branch in the model has an extra
`Multiply(Divide, Constant)` node between the normalization `Divide` and
the `Slice`.
The `FuseMOE3GemmCompressed` pattern's `sig_slice` (defined at line 91
of
[fuse_moe_3gemm_compressed.cpp](src/plugins/intel_gpu/src/plugin/transformations/fuse_moe_3gemm_compressed.cpp))
expects:
```
Slice( Divide(...), ... )
```
But the graph has:
```
Slice( Multiply( Divide(...), Constant ), ... )
```
The extra `Multiply` is a routing-weight scaling constant multiply
inserted after normalization. It is present in the Trinity-Mini model
but was not present in the models used when the pattern was originally
written.
---
## Log evidence
From both `/tmp/matcher.log` (original run) and the unit test
reproducer:
```
{ ARGUMENT 0: WrapType<Divide>
MATCHING PATTERN NODE: WrapType<Divide>(WrapType, WrapType)
AGAINST GRAPH NODE: Multiply(Divide, Constant)
} NODES' TYPE DIDN'T MATCH. EXPECTED: WrapType<Divide>. OBSERVED: Multiply
```
**Location in the match tree:**
`MOECompressed` → ARGUMENT 1 (`Unsqueeze`) → ARGUMENT 0 (`Reshape`) →
ARGUMENT 0 (`Transpose`) → ARGUMENT 0 (`ScatterElementsUpdate`) →
**ARGUMENT 2** (`patternOr(sm_norm_slice | sig_slice)`) → BRANCH 1
(`sig_slice`) → **ARGUMENT 0** (expected `WrapType<Divide>`, observed
`Multiply`).
**Failing graph node:** `Multiply(Divide, Constant)` — the
routing-weight scaling step inserted between `sig_norm` (Divide) and
`sig_slice` (Slice).
---
## Crash consequence
Because `FuseMOE3GemmCompressed` never fires,
`KeepMOE3GemmConstPrecision` finds no `MOE3GemmFusedCompressed` nodes
and also does nothing. The GPU program builder then fails:
```
Exception from src/plugins/intel_gpu/src/plugin/program_builder.cpp:265:
Input moecompressed:__module.model.layers.2.mlp/aten::sum/ReduceSum/MOECompressed
hasn't been found in primitive_ids map
```
---
## Resolution
In
[src/plugins/intel_gpu/src/plugin/transformations/fuse_moe_3gemm_compressed.cpp](src/plugins/intel_gpu/src/plugin/transformations/fuse_moe_3gemm_compressed.cpp)
around lines 90–91, add an `optional<Multiply>` wrapper between
`sig_norm` and `sig_slice`:
```cpp
// Before:
auto sig_norm = wrap_type<ov::op::v1::Divide>({sig_gather_el, sig_add_eps}, consumers_count(1));
auto sig_slice = wrap_type<ov::op::v8::Slice>({sig_norm, ANY, ANY, ANY, ANY}, consumers_count(1));
// After:
auto sig_norm = wrap_type<ov::op::v1::Divide>({sig_gather_el, sig_add_eps}, consumers_count(1));
auto sig_norm_scaled = optional<ov::op::v1::Multiply>({sig_norm, ANY}); // absorbs optional scaling
auto sig_slice = wrap_type<ov::op::v8::Slice>({sig_norm_scaled, ANY, ANY, ANY, ANY}, consumers_count(1));
```
`optional<Multiply>` will match `Multiply(Divide, Constant)` when
present, or fall through to `Divide` directly for models without the
extra scaling — no change to the existing passing test cases.
> **Note:** Once the pattern fires, also verify that
`MOE3GemmFusedCompressed`'s kernel/op implementation correctly applies
the routing-weight scaling factor. The `Multiply` constant is currently
not passed as an explicit argument to the fused op — the kernel may need
a new input or an internal constant to reproduce correct numerical
output.
---
## Reproducer Test
**File:**
[src/plugins/intel_gpu/tests/unit/transformations/fuse_moe_3gemm_compressed_test.cpp](src/plugins/intel_gpu/tests/unit/transformations/fuse_moe_3gemm_compressed_test.cpp)
**Test name:**
`TransformationTestsF.FuseMOE3GemmCompressed_SigmoidBias_ScaledNorm`
**Build target:** `ov_gpu_unit_tests`
**Run command:**
```bash
cd /home/guest/golubevv/openvino/bin/intel64/Release
OV_MATCHER_LOGGING=true OV_MATCHERS_TO_LOG=FuseMOE3GemmCompressed \
./ov_gpu_unit_tests \
--gtest_filter="*FuseMOE3GemmCompressed_SigmoidBias_ScaledNorm*"
```
**Status before fix:** PASS ✅ — the transformation does not fire so the
model is unchanged and matches the auto-cloned `model_ref`. This
confirms the bug is reproduced.
The test log shows the identical failure phrase:
```
NODES' TYPE DIDN'T MATCH. EXPECTED: WrapType<Divide>. OBSERVED: Multiply
AGAINST GRAPH NODE: Slice(Multiply, Constant, ShapeOf, Constant, Constant)
```
**After fix:** the test will FAIL because `model` is now transformed and
no longer matches the auto-cloned ref. At that point, add an explicit
`model_ref` block with the expected `MOE3GemmFusedCompressed` result
graph to turn it into a proper regression guard.
</details>
### Tickets:
- *N\A*
### AI Assistance:
- *yes*
- *AI was used to improve the skill based on real usage examples*
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>