mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build (#27788)
### Description
This change introduces a new AArch64 assembly owner for the NCHWc float
convolution path and keeps the existing C++ entrypoint stable, with
build wiring so the same kernel entry is used transparently by existing
call sites.
The optimization strategy is centered on three ideas:
1. Route the stable NCHWc C++ entrypoint to an AArch64 assembly
implementation.
2. Split execution into left/middle/right regions, with a 4-output
center tile and 3/2/1 remainder kernels in the middle region.
3. Duplicate hot loops for flags == 0 so the no-post-op path avoids
repeated accumulate/bias/ReLU flag checks inside the hottest loops.
### Motivation and Context
This improves throughput while preserving existing behavior and
integration points and address comment
https://github.com/microsoft/onnxruntime/pull/27099#issuecomment-3792485400
from that PR (cc: @Rohanjames1997 and @hariharans29).
Per-convolution improvements from convolutions in the model from:
https://github.com/microsoft/onnxruntime/pull/25580#issuecomment-3321304864
per core count when running on AWS Graviton 4 (all of these convolution
have common N = 1, KH = 3, KW = 3, DH = 1, DW = 1 and G = 1)
<table>
<thead>
<tr>
<th>Cores</th>
<th>IC</th><th>OC</th><th>IH</th><th>IW</th><th>OH</th><th>OW</th>
<th>SH</th><th>SW</th><th>PT</th><th>PL</th><th>PB</th><th>PR</th>
<th>P90 Improvement</th><th>P99 Improvement</th>
</tr>
</thead>
<tbody>
<tr><td
rowspan="5">1</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.57%</td><td>+43.64%</td></tr>
<tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.56%</td><td>+44.69%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.25%</td><td>+45.12%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.73%</td><td>+47.07%</td></tr>
<tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.73%</td><td>+47.76%</td></tr>
<tr><td
rowspan="5">2</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.72%</td><td>+42.78%</td></tr>
<tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+48.41%</td><td>+47.76%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.13%</td><td>+44.62%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.58%</td><td>+46.65%</td></tr>
<tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.59%</td><td>+47.32%</td></tr>
<tr><td
rowspan="5">4</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+45.18%</td><td>+38.51%</td></tr>
<tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+46.34%</td><td>+45.25%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+46.17%</td><td>+42.45%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.20%</td><td>+45.25%</td></tr>
<tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.32%</td><td>+46.67%</td></tr>
<tr><td
rowspan="5">8</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+40.07%</td><td>+25.57%</td></tr>
<tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+42.93%</td><td>+45.06%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+44.50%</td><td>+38.86%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.31%</td><td>+45.22%</td></tr>
<tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.02%</td><td>+46.42%</td></tr>
<tr><td
rowspan="5">16</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+34.06%</td><td>+28.48%</td></tr>
<tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+38.12%</td><td>+39.83%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+42.17%</td><td>+35.48%</td></tr>
<tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+45.25%</td><td>+42.73%</td></tr>
<tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+46.44%</td><td>+45.84%</td></tr>
</tbody>
</table>
And end-to-end performance improvement when running the model on
different core count:
| Model | Cores | P90 Improvement | P99 Improvement |
|---|---:|---:|---:|
| shareable_model.onnx | 1 | +24.35% | +24.17% |
| shareable_model.onnx | 2 | +24.14% | +23.82% |
| shareable_model.onnx | 4 | +23.40% | +23.22% |
| shareable_model.onnx | 8 | +22.72% | +22.55% |
| shareable_model.onnx | 16 | +20.19% | +19.89% |
Running `bench_sconv_nchwc.cpp` benchmark on AWS Graviton 4
`c8g.16xlarge` the following is obtained:
```
$ ./build/Linux/Release/onnxruntime_mlas_benchmark --benchmark_filter='SCONV_NCHWC_DIRECT/DirectNchwcCases'
The number of inputs is very large. QNBITGEMM<float, 4> will be repeated at least 128 times.
The number of inputs is very large. QNBITGEMM<float, 8> will be repeated at least 128 times.
2026-03-23T17:35:58+00:00
Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 2048 KiB (x64)
L3 Unified 36864 KiB (x1)
Load Average: 0.01, 0.13, 0.36
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 9169970 ns 9170100 ns 76
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 7048829 ns 7047903 ns 99
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 5422228 ns 5421912 ns 129
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 22050813 ns 22047140 ns 31
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 9856375 ns 9855928 ns 71
```
Running the same benchmark on the same system without this PR we get:
```
Running ./build/Linux/Release/onnxruntime_mlas_benchmark
Run on (64 X 2000 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 2048 KiB (x64)
L3 Unified 36864 KiB (x1)
Load Average: 24.34, 13.45, 5.35
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 17576371 ns 17576571 ns 40
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 14319665 ns 14316035 ns 49
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 11710396 ns 11707113 ns 60
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 47237327 ns 47217289 ns 15
SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 21110794 ns 21103617 ns 33
```
which shows the following speed up:
| Benchmark row | First run (ns) | Second run (ns) | Speedup (first vs
second) |
|---|---:|---:|---:|
| IC32 OC32 IH192 IW192 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 9,169,970 |
17,576,371 | 1.92x |
| IC32 OC96 IH192 IW192 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 | 7,048,829 |
14,319,665 | 2.03x |
| IC48 OC192 IH96 IW96 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 | 5,422,228 |
11,710,396 | 2.16x |
| IC48 OC192 IH96 IW96 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 22,050,813 |
47,237,327 | 2.14x |
| IC64 OC256 IH48 IW48 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 9,856,375 |
21,110,794 | 2.14x |
---------
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>