onnxruntime
2d2a3e57 - NEON kernels for NCHWc Convolution and Pooling (#25580)

Commit
93 days ago
NEON kernels for NCHWc Convolution and Pooling (#25580) ### Description This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, significantly improving performance on Arm64 platforms. ### Motivation and Context Fixes #24790 The new NCHWc kernels improve performance by 5-6x, depending on the configuration of threads, model, etc. For example, here is the performance gain witnessed during mobilenet inference: Focus on the "Number of inferences per second" (93 inf/s -> 498 inf/s) <details> <summary>System configuration</summary> ``` Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 1 Stepping: r0p1 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti Caches (sum of all): L1d: 4 MiB (64 instances) L1i: 4 MiB (64 instances) L2: 128 MiB (64 instances) L3: 36 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Not affected Srbds: Not affected Tsx async abort: Not affected ``` </details> <details> <summary>Perf with current upstream kernels</summary> ``` ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx Setting intra_op_num_threads to 32 Session creation time cost: 0.0238608 s First inference time cost: 11 ms Total inference time cost: 10.7458 s Total inference requests: 1000 Average inference time cost: 10.7458 ms Total inference run time: 10.7465 s Number of inferences per second: 93.0534 Avg CPU usage: 50 % Peak working set size: 70410240 bytes Avg CPU usage:50 Peak working set size:70410240 Runs:1000 Min Latency: 0.0106707 s Max Latency: 0.0113617 s P50 Latency: 0.0107453 s P90 Latency: 0.0107695 s P95 Latency: 0.0107785 s P99 Latency: 0.0107965 s P999 Latency: 0.0113617 s ``` </details> <details> <summary>Perf with NCHWc kernels</summary> ``` ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx Setting intra_op_num_threads to 32 Session creation time cost: 0.0358121 s First inference time cost: 2 ms Total inference time cost: 2.00561 s Total inference requests: 1000 Average inference time cost: 2.00561 ms Total inference run time: 2.00607 s Number of inferences per second: 498.488 Avg CPU usage: 50 % Peak working set size: 92467200 bytes Avg CPU usage:50 Peak working set size:92467200 Runs:1000 Min Latency: 0.00198387 s Max Latency: 0.00204784 s P50 Latency: 0.00200537 s P90 Latency: 0.0020155 s P95 Latency: 0.00201822 s P99 Latency: 0.0020251 s P999 Latency: 0.00204784 s ``` </details> Happy to run further performance tests as required.
Parents
Loading