mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection (#27099)
## Overview
This PR adds ARM64 NEON assembly micro‑kernels for NCHW, depthwise, and
pointwise convolution, wires them into the MLAS build, and adds
shape‑based selection heuristics for NCHWC depthwise/pointwise to favor
the asm kernels in safe cases (stride‑1 pointwise; wider depthwise
outputs). The BF16 path is unchanged.
## Key changes
- cmake/onnxruntime_mlas.cmake
- Add new AArch64 assembly sources for NCHW, depthwise, and pointwise
conv to the MLAS build.
- onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S
- New vectorised NCHW convolution micro‑kernel.
- onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S
- New vectorised depthwise micro‑kernel (fast path for in‑bounds loads,
slow path for padding).
- onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S
- New vectorised pointwise micro‑kernel (multi‑output reuse).
- onnxruntime/core/mlas/lib/mlasi.h,
onnxruntime/core/mlas/lib/platform.cpp
- Declare/register new asm kernels and prefer them on ARM64.
- onnxruntime/core/mlas/lib/snchwc.cpp
- Heuristics: use pointwise asm when StrideHeight == 1 && StrideWidth ==
1 and OutputThisIteration >= 4; use depthwise asm when OutputWidth >= 4.
- onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp
- Include fix for the conv kernel flags header.
## Performance
Numbers below are expressed as multipliers vs the non‑NCHWC baseline
(same model and perf_test settings):
Baseline (no `--enable_arm_neon_nchwc`)
- 8 cores: 1.00×
- 16 cores: 1.00×
With `--enable_arm_neon_nchwc` (no asm additions/heuristics)
- 8 cores: 1.18×
- 16 cores: 1.24×
With this PR (asm kernels + heuristics)
- 8 cores: 1.77×
- 16 cores: 2.54×
## Testing
- `./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`
- `OMP_NUM_THREADS=8 ./build/Linux/Release/onnxruntime_perf_test -I -m
times -r 1000 --x 8 ~/mobilenetv2-7.onnx`
---------
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>