onnxruntime
e10ea97c - mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build (#27788)

Commit
67 days ago
mlas/arm64: Add AArch64 assembly path for NCHWc float kernel and wire into build (#27788) ### Description This change introduces a new AArch64 assembly owner for the NCHWc float convolution path and keeps the existing C++ entrypoint stable, with build wiring so the same kernel entry is used transparently by existing call sites. The optimization strategy is centered on three ideas: 1. Route the stable NCHWc C++ entrypoint to an AArch64 assembly implementation. 2. Split execution into left/middle/right regions, with a 4-output center tile and 3/2/1 remainder kernels in the middle region. 3. Duplicate hot loops for flags == 0 so the no-post-op path avoids repeated accumulate/bias/ReLU flag checks inside the hottest loops. ### Motivation and Context This improves throughput while preserving existing behavior and integration points and address comment https://github.com/microsoft/onnxruntime/pull/27099#issuecomment-3792485400 from that PR (cc: @Rohanjames1997 and @hariharans29). Per-convolution improvements from convolutions in the model from: https://github.com/microsoft/onnxruntime/pull/25580#issuecomment-3321304864 per core count when running on AWS Graviton 4 (all of these convolution have common N = 1, KH = 3, KW = 3, DH = 1, DW = 1 and G = 1) <table> <thead> <tr> <th>Cores</th> <th>IC</th><th>OC</th><th>IH</th><th>IW</th><th>OH</th><th>OW</th> <th>SH</th><th>SW</th><th>PT</th><th>PL</th><th>PB</th><th>PR</th> <th>P90 Improvement</th><th>P99 Improvement</th> </tr> </thead> <tbody> <tr><td rowspan="5">1</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.57%</td><td>+43.64%</td></tr> <tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.56%</td><td>+44.69%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.25%</td><td>+45.12%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.73%</td><td>+47.07%</td></tr> <tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.73%</td><td>+47.76%</td></tr> <tr><td rowspan="5">2</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.72%</td><td>+42.78%</td></tr> <tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+48.41%</td><td>+47.76%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+47.13%</td><td>+44.62%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.58%</td><td>+46.65%</td></tr> <tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.59%</td><td>+47.32%</td></tr> <tr><td rowspan="5">4</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+45.18%</td><td>+38.51%</td></tr> <tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+46.34%</td><td>+45.25%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+46.17%</td><td>+42.45%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+48.20%</td><td>+45.25%</td></tr> <tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.32%</td><td>+46.67%</td></tr> <tr><td rowspan="5">8</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+40.07%</td><td>+25.57%</td></tr> <tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+42.93%</td><td>+45.06%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+44.50%</td><td>+38.86%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.31%</td><td>+45.22%</td></tr> <tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+47.02%</td><td>+46.42%</td></tr> <tr><td rowspan="5">16</td><td>32</td><td>32</td><td>192</td><td>192</td><td>192</td><td>192</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+34.06%</td><td>+28.48%</td></tr> <tr><td>32</td><td>96</td><td>192</td><td>192</td><td>96</td><td>96</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+38.12%</td><td>+39.83%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>48</td><td>48</td><td>2</td><td>2</td><td>0</td><td>0</td><td>1</td><td>1</td><td>+42.17%</td><td>+35.48%</td></tr> <tr><td>48</td><td>192</td><td>96</td><td>96</td><td>96</td><td>96</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+45.25%</td><td>+42.73%</td></tr> <tr><td>64</td><td>256</td><td>48</td><td>48</td><td>48</td><td>48</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>+46.44%</td><td>+45.84%</td></tr> </tbody> </table> And end-to-end performance improvement when running the model on different core count: | Model | Cores | P90 Improvement | P99 Improvement | |---|---:|---:|---:| | shareable_model.onnx | 1 | +24.35% | +24.17% | | shareable_model.onnx | 2 | +24.14% | +23.82% | | shareable_model.onnx | 4 | +23.40% | +23.22% | | shareable_model.onnx | 8 | +22.72% | +22.55% | | shareable_model.onnx | 16 | +20.19% | +19.89% | Running `bench_sconv_nchwc.cpp` benchmark on AWS Graviton 4 `c8g.16xlarge` the following is obtained: ``` $ ./build/Linux/Release/onnxruntime_mlas_benchmark --benchmark_filter='SCONV_NCHWC_DIRECT/DirectNchwcCases' The number of inputs is very large. QNBITGEMM<float, 4> will be repeated at least 128 times. The number of inputs is very large. QNBITGEMM<float, 8> will be repeated at least 128 times. 2026-03-23T17:35:58+00:00 Running ./build/Linux/Release/onnxruntime_mlas_benchmark Run on (64 X 2000 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 2048 KiB (x64) L3 Unified 36864 KiB (x1) Load Average: 0.01, 0.13, 0.36 -------------------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------------------------------------------------------------------- SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 9169970 ns 9170100 ns 76 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 7048829 ns 7047903 ns 99 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 5422228 ns 5421912 ns 129 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 22050813 ns 22047140 ns 31 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 9856375 ns 9855928 ns 71 ``` Running the same benchmark on the same system without this PR we get: ``` Running ./build/Linux/Release/onnxruntime_mlas_benchmark Run on (64 X 2000 MHz CPU s) CPU Caches: L1 Data 64 KiB (x64) L1 Instruction 64 KiB (x64) L2 Unified 2048 KiB (x64) L3 Unified 36864 KiB (x1) Load Average: 24.34, 13.45, 5.35 -------------------------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------------------------------------------------------------------------------- SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:32/IH:192/IW:192/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 17576371 ns 17576571 ns 40 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:32/OC:96/IH:192/IW:192/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 14319665 ns 14316035 ns 49 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:0/PL:0/PB:1/PR:1/S:2/D:1/real_time 11710396 ns 11707113 ns 60 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:48/OC:192/IH:96/IW:96/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 47237327 ns 47217289 ns 15 SCONV_NCHWC_DIRECT/DirectNchwcCases/IC:64/OC:256/IH:48/IW:48/KH:3/KW:3/PT:1/PL:1/PB:1/PR:1/S:1/D:1/real_time 21110794 ns 21103617 ns 33 ``` which shows the following speed up: | Benchmark row | First run (ns) | Second run (ns) | Speedup (first vs second) | |---|---:|---:|---:| | IC32 OC32 IH192 IW192 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 9,169,970 | 17,576,371 | 1.92x | | IC32 OC96 IH192 IW192 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 | 7,048,829 | 14,319,665 | 2.03x | | IC48 OC192 IH96 IW96 KH3 KW3 PT0 PL0 PB1 PR1 S2 D1 | 5,422,228 | 11,710,396 | 2.16x | | IC48 OC192 IH96 IW96 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 22,050,813 | 47,237,327 | 2.14x | | IC64 OC256 IH48 IW48 KH3 KW3 PT1 PL1 PB1 PR1 S1 D1 | 9,856,375 | 21,110,794 | 2.14x | --------- Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Author
Parents
Loading