[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512 (#27874)
### Description
Adds a special AVX512 kernel for depthwise conv with multiplier = 2.
These improve the performance of 3 costly conv operations (7x7 kernels)
in the MobileClip model by approx 2.4x (will share MLAS benchmark
numbers).
These are 3 ops with
1) Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7
2) Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7
3) Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7
These Conv operations cannot be dispateched to NCHWc as the Cout per
group is sub-block size. On AVX512, the block size is 16 and the Cout
per group is only 2. There is a special depthwise kernel in the NCHWc
suite but it can only handle Cout per group = 1.
MLAS Benchmark Before and After comparison:
| Benchmark | BEFORE mean (ns) | AFTER mean (ns) | Speedup |
|---|---:|---:|---:|
| SCONV_NCHW G64 | 3,151,190 | 1,391,419 | 2.26x |
| SCONV_NCHW G128 | 1,646,040 | 824,654 | 2.00x |
| SCONV_NCHW G256 | 978,843 | 533,375 | 1.84x |
| SCONV_NCHW_THREADED G64 | 873,283 | 367,722 | 2.37x |
| SCONV_NCHW_THREADED G128 | 445,786 | 226,777 | 1.97x |
| SCONV_NCHW_THREADED G256 | 264,473 | 147,997 | 1.79x |
### Motivation and Context
Just by optimizing these 3 conv operations, MobileClip is about
700us-850us faster and the entire model is <14ms on an AVX512 machine.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>