[CoreML EP] Support bool Cast in ML Program (#28595)
### Summary
Two changes to the ML Program `Cast` builder:
1. **Accept `BOOL` as a source and target dtype** in
`HasSupportedInputsImpl`. The
ML Program `cast` op already handles bool, and `AddToModelBuilderImpl`
already
maps `to == BOOL`; only the input/output type gate omitted it.
2. **Move the "no preceding node" check after the ML Program
early-return.** That
check is legacy gating for the NeuralNetwork ArgMax-only path (which
dereferences `InputEdgesBegin()`); on the ML Program path a `Cast` fed
directly
by a graph input is fine, and rejecting it forced needless CPU fallback.
### Why
This is the first of a **4-PR series** giving the CoreML EP the op
coverage to run
transformer and diffusion graphs as a *single CoreML partition* instead
of
fragmenting across CPU.
Transformer attention-mask graphs are a `Cast → GatherND → And → Where`
chain over
**bool** tensors. A CoreML partition cannot have a bool input/output
(CoreML
`MLMultiArray` has no bool type), so bool must stay *internal* — which
makes `Cast`
(the int↔bool boundary) the prerequisite for the rest of the series.
### Combined impact of the series
With all four PRs plus #28278 (scalar-`Gather`), every model below goes
from 2
CoreML partitions to **1, with zero graph breaks** — the whole graph
runs on
CoreML. Measured on an Apple M3 Max, ML Program format:
| Model | partitions (before → after) | CoreML vs CPU |
|-------|:---------------------------:|--------------:|
| BERT-large (340M) | 2 → 1 | 7.3× (fp32) / 11.0× (fp16) |
| ViT-large (304M) | 2 → 1 | 8.5× (fp32) / 10.3× (fp16) |
| GPT-2-large (774M) | 2 → 1 | 11.4× (fp16) |
| SD-1.5 UNet (860M) | 2 → 1 | 9.7× (fp16) |
The op builders eliminate the graph breaks (deterministic); the speedups
are what
CoreML already delivers once a model is no longer fragmented.
### Tests (`coreml_basic_test.cc`)
- `CastNonArgMaxNeuralNetworkNotSupported` — an `int64 → bool → float`
cast chain
falls back to CPU on the NeuralNetwork format, guarding the
`IsOpSupportedImpl`
reordering.
Positive `bool`-Cast coverage is in the dependent PRs: `Cast → GatherND
→ Cast`
(#28598's `GatherNDBoolData_MLProgram`) and `Cast → And → Cast`
(#28597's
`And_MLProgram`). Both place a non-`Cast` op between the int↔bool casts
and check
the result against the CPU EP. A *standalone* `int64 → Cast(bool) →
Cast(float)`
round-trip can't be verified here — CoreML's compiler fuses back-to-back
`cast`
ops and drops the bool clamp — so the pattern needs that intervening op,
which
only the dependent PRs provide.
### Series — CoreML EP coverage for transformer / diffusion graphs
- **#28595 — Support bool Cast in ML Program** *(this PR —
prerequisite)*
- #28596 — Add Sin and Cos unary ops *(independent)*
- #28597 — Add Where and And builders *(depends on #28595)*
- #28598 — Add GatherND builder *(depends on #28595)*
Together with #28278 (scalar-`Gather`), the series takes BERT / GPT-2 /
ViT /
diffusion-UNet graphs — tiny and full-size — from 2 CoreML partitions to
1, with
zero graph breaks.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>