[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688)
### Description
**Motivation and approach taken:**
Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.
Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
https://github.com/microsoft/onnxruntime/pull/26688/commits/2820a84261123499e6ddb03e734810d8f6ad98ed).
The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.
**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.
**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):
<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>
### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup
For those interested,
https://github.com/microsoft/onnxruntime/pull/26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>