[CoreML EP] Add FusedConv support (#28289)
### Description
Adds support for \`com.microsoft:FusedConv\` to the CoreML EP's
MLProgram and NeuralNetwork paths. \`FusedConv\` is produced by ORT's
\`ConvActivationFusion\` pass when a model is optimized with the CPU EP
(or any EP in \`cpu_acl_js_webgpu_eps\`) and saved via
\`session.optimized_model_filepath\` or the ORT-format conversion tool.
The saved graph contains \`com.microsoft:FusedConv\` nodes that — before
this patch — the CoreML EP could not claim, fragmenting the partition.
ORT's in-process pipeline does not currently run
\`ConvActivationFusion\` when CoreML EP is the target (the fusion's
compat set excludes CoreML), so \`FusedConv\` typically reaches the
CoreML EP only via pre-optimized graphs. That's a real and common
workflow: anyone shipping a pre-optimized model artifact (mobile
pipelines, ORT-format models, session-cached optimized graphs) that's
then loaded with the CoreML EP hits this path.
There's no pre-existing issue tracking this; it was discovered via
DWPose / ResNet50 partitioning analysis on Apple Silicon.
### Empirical impact
ResNet50-v2 from the ONNX model zoo, CPU-optimized at
\`ORT_ENABLE_EXTENDED\` and reloaded on the CoreML EP (108 nodes total,
33 of them \`FusedConv\` with Relu activation). M3 Max, MLProgram, batch
1, 100-iter timed runs, 3 interleaved rounds (n=597 per variant):
| | Partitions | Nodes on CoreML | Mean | StdDev | P99 | Max |
|---|---|---|---|---|---|---|
| Without this patch | 18 | 75 / 108 | 23.34 ms | 1.01 | 27.68 | 30.59 |
| **With this patch** | **1** | **108 / 108** | **2.94 ms** | **0.16** |
**3.75** | **4.32** |
**7.94× mean speedup.** The 33 FusedConv nodes that previously fell back
to CPU now stay on the ANE/GPU. Variance also tightens 6× (stddev 1.01 →
0.16).
Partition counts on other Conv-heavy ONNX-zoo models post
CPU-optimization:
| Model | Without | With | Notes |
|---|---|---|---|
| ResNet50-v2 | 18 | **1** | 33 FusedConv (Relu) |
| FCN-ResNet50 | 18 | **1** | 35 FusedConv (Relu); fails to compile on
CoreML for unrelated reasons |
| YOLOv3 (full) | 27 | **4** | 72 FusedConv (LeakyRelu); detection
post-proc fails on CoreML for unrelated dynamic-shape reasons |
| YOLOv3-tiny | 13 | **7** | 11 FusedConv (LeakyRelu); same |
Partition reduction is robust across architectures. ResNet50 is the
configuration that runs end-to-end on this exact ONNX-zoo collection on
the CoreML EP today; the FCN/YOLO failures are orthogonal CoreML-EP
limitations on segmentation upsampling and detection post-processing.
### Implementation
Reuses \`ConvOpBuilder\`, which now branches on \`op_type\`:
- \`Conv\`: behaviour unchanged.
- \`FusedConv\`: emit the \`conv\` MIL op into an intermediate, then
chain the activation MIL op on top. Supports all six activation types
\`ConvActivationFusion\` produces:
| ONNX activation | MIL op | params |
|---|---|---|
| Relu | \`relu\` | – |
| Sigmoid | \`sigmoid\` | – |
| Tanh | \`tanh\` | – |
| LeakyRelu | \`leaky_relu\` | alpha (from \`activation_params\`) |
| Clip | \`clip\` | alpha=min, beta=max (from \`activation_params\`) |
| HardSigmoid | \`sigmoid_hard\` | alpha, beta (from
\`activation_params\`) |
\`IsOpSupportedImpl\` rejects \`FusedConv\` in NeuralNetwork mode (which
would emit an unfused Conv and silently lose the activation) and rejects
any unrecognized activation string.
### Tests
Six new tests in
\`onnxruntime/test/providers/coreml/coreml_basic_test.cc\`, one per
supported activation class (param-less, single-param,
two-param-positional, two-param-named):
- \`FusedConvTestRelu\` — no \`activation_params\` attribute
- \`FusedConvTestSigmoid\` — same shape, exercises sigmoid op-name
dispatch
- \`FusedConvTestTanh\` — same shape, exercises tanh op-name dispatch
- \`FusedConvTestLeakyRelu\` — single param (alpha); the YOLOv3 case
- \`FusedConvTestClip\` — two params (min, max)
- \`FusedConvTestHardSigmoid\` — two params (alpha, beta); depends on
the HardSigmoid CoreML builder landed in #28182
Each verifies CoreML output against the CPU EP reference and asserts
\`ExpectedEPNodeAssignment::All\`. All pass locally on macOS 26.3 / M3
Max.
Also adds the supported-ops doc entry.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>