onnxruntime
c03c419f - [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688)

Commit

88 days ago

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: https://github.com/microsoft/onnxruntime/pull/26688/commits/2820a84261123499e6ddb03e734810d8f6ad98ed). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, https://github.com/microsoft/onnxruntime/pull/26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

References

#26688 - [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics

Author

hariharans29

Parents

ba11af41

onnxruntime c03c419f - [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688)

onnxruntime
c03c419f - [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688)