onnxruntime
b49fc62e - [MLAS] DequantizeLinear int8/uint8 (#24818)

Commit
197 days ago
[MLAS] DequantizeLinear int8/uint8 (#24818) ### Description - Adds multithreaded vectorized implementations of DequantizeLinear for int8 and uint8 inputs: - Intel SSE 2 - ARM NEON - All other architectures fallback to a multithreaded scalar reference implementation (previous was not multithreaded). - **Note**: only enabled if ORT is built for client/on-device workloads (`ORT_CLIENT_PACKAGE_BUILD` is defined). INT8 DequantizeLinear latency on Intel Core i9-10920X with 4 intra op threads (SSE 2 implementation) | Number of elements | Baseline latency (us) | Multithreaded+SIMD latency (us) | Speedup | | ----------------------- | ---------------------- | ------------------------------------ | ---------- | | 10 K | 1 | 1 | 1 | | 20 K | 2 | 2 | 1 | | 40 K | 5 | 5 | 1 | | 80 K | 11 | 4 | 2.75 | | 100 K | 14 | 5 | 2.80 | | 150 K | 21 | 7 | 3.00 | | 200 K | 28 | 8 | 3.50 | | 400 K | 68 | 15 | 4.53 | | 600 K | 107 | 21 | 5.10 | | 800 K | 142 | 28 | 5.07 | | 1 M | 187 | 42 | 4.45 | | 2 M | 376 | 102 | 3.69 | | 4 M | 880 | 236 | 3.73 | | 6 M | 1547 | 557 | 2.78 | | 8 M | 2438 | 1097 | 2.22 | | 10 M | 3192 | 1464 | 2.18 | | 100 M | 38718 | 17733 | 2.18 | INT8 DequantizeLinear latency on Snapdragon 8cx gen 3 @ 3.4GHz with 4 intra op threads (NEON implementation) | Number of elements | Baseline latency (us) | Multithreaded+SIMD latency (us) | Speedup | | ----------------------- | ---------------------- | ------------------------------------ | ---------- | | 10 K | 1 | 1 | 1 | | 20 K | 1 | 1 | 1 | | 40 K | 3 | 3 | 1 | | 80 K | 7 | 4 | 1.75 | | 100 K | 9 | 3 | 3.00 | | 150 K | 14 | 5 | 2.80 | | 200 K | 18 | 6 | 3.00 | | 400 K | 38 | 10 | 3.80 | | 600 K | 61 | 15 | 4.07 | | 800 K | 76 | 19 | 4.00 | | 1 M | 98 | 24 | 4.08 | | 2 M | 204 | 48 | 4.25 | | 4 M | 424 | 112 | 3.79 | | 6 M | 677 | 384 | 1.76 | | 8 M | 919 | 621 | 1.48 | | 10 M | 1132 | 776 | 1.46 | | 100 M | 11842 | 10566 | 1.12 | ### Motivation and Context Improves latency of quantized QDQ models that with large DQs that dominate the inference latency.
Parents
Loading