onnxruntime
57b265ee - [MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512 (#27874)

Commit
28 days ago
[MLAS] Add depthwise with multiplier conv special kernel for NCHW data layout on Avx512 (#27874) ### Description Adds a special AVX512 kernel for depthwise conv with multiplier = 2. These improve the performance of 3 costly conv operations (7x7 kernels) in the MobileClip model by approx 2.4x (will share MLAS benchmark numbers). These are 3 ops with 1) Cin=64, Cout=128, group=64, H=64, W=64, kH=7, kW=7 2) Cin=128, Cout=256, group=128, H=32, W=32, kH=7, kW=7 3) Cin=256, Cout=512, group=256, H=16, W=16, kH=7, kW=7 These Conv operations cannot be dispateched to NCHWc as the Cout per group is sub-block size. On AVX512, the block size is 16 and the Cout per group is only 2. There is a special depthwise kernel in the NCHWc suite but it can only handle Cout per group = 1. MLAS Benchmark Before and After comparison: | Benchmark | BEFORE mean (ns) | AFTER mean (ns) | Speedup | |---|---:|---:|---:| | SCONV_NCHW G64 | 3,151,190 | 1,391,419 | 2.26x | | SCONV_NCHW G128 | 1,646,040 | 824,654 | 2.00x | | SCONV_NCHW G256 | 978,843 | 533,375 | 1.84x | | SCONV_NCHW_THREADED G64 | 873,283 | 367,722 | 2.37x | | SCONV_NCHW_THREADED G128 | 445,786 | 226,777 | 1.97x | | SCONV_NCHW_THREADED G256 | 264,473 | 147,997 | 1.79x | ### Motivation and Context Just by optimizing these 3 conv operations, MobileClip is about 700us-850us faster and the entire model is <14ms on an AVX512 machine. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Author
Parents
Loading