onnxruntime
c7c4243c - [MLAS] Simplify & optimize Arm64 NCHWc Convolution kernels (#26691)

Commit
31 days ago
[MLAS] Simplify & optimize Arm64 NCHWc Convolution kernels (#26691) ### Description This PR makes the following changes: 1. Reroutes Pointwise Convolution to use GEMM, as it is essentially the same. This unlocks the performance benefits of GEMM too, as seen in the performance section below. 2. Eliminate some branches in the code. This is achieved by using in-built MLAS functions like `MlasBlendFloat32x4` 3. Expand the unit test coverage of NCHWc Conv kernels to catch edge cases. ### Performance This speeds up any Conv model that uses the pointwise kernel. For example, Mobilenet inference speeds up from 500 inf/sec to 550 inf/sec. ### Testing - Build passed: `./build.sh --config=Release --build_shared_lib --parallel --cmake_extra_defines onnxruntime_USE_ARM_NEON_NCHWC=ON` - Unit tests passed: `./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=Conv2dNchwc_*` - Perf: `./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx` Happy to run additional perf tests as required. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Committer
sumikuma
Parents
Loading