onnxruntime
bd8f781f - mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection (#27099)

Commit
72 days ago
mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection (#27099) ## Overview This PR adds ARM64 NEON assembly micro‑kernels for NCHW, depthwise, and pointwise convolution, wires them into the MLAS build, and adds shape‑based selection heuristics for NCHWC depthwise/pointwise to favor the asm kernels in safe cases (stride‑1 pointwise; wider depthwise outputs). The BF16 path is unchanged. ## Key changes - cmake/onnxruntime_mlas.cmake - Add new AArch64 assembly sources for NCHW, depthwise, and pointwise conv to the MLAS build. - onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S - New vectorised NCHW convolution micro‑kernel. - onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S - New vectorised depthwise micro‑kernel (fast path for in‑bounds loads, slow path for padding). - onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S - New vectorised pointwise micro‑kernel (multi‑output reuse). - onnxruntime/core/mlas/lib/mlasi.h, onnxruntime/core/mlas/lib/platform.cpp - Declare/register new asm kernels and prefer them on ARM64. - onnxruntime/core/mlas/lib/snchwc.cpp - Heuristics: use pointwise asm when StrideHeight == 1 && StrideWidth == 1 and OutputThisIteration >= 4; use depthwise asm when OutputWidth >= 4. - onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp - Include fix for the conv kernel flags header. ## Performance Numbers below are expressed as multipliers vs the non‑NCHWC baseline (same model and perf_test settings): Baseline (no `--enable_arm_neon_nchwc`) - 8 cores: 1.00× - 16 cores: 1.00× With `--enable_arm_neon_nchwc` (no asm additions/heuristics) - 8 cores: 1.18× - 16 cores: 1.24× With this PR (asm kernels + heuristics) - 8 cores: 1.77× - 16 cores: 2.54× ## Testing - `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc` - `OMP_NUM_THREADS=8 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 8 ~/mobilenetv2-7.onnx` --------- Signed-off-by: Milos Puzovic <milos.puzovic@arm.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Author
Parents
Loading