[MLAS] DequantizeLinear int8/uint8 (#24818)
### Description
- Adds multithreaded vectorized implementations of DequantizeLinear for
int8 and uint8 inputs:
- Intel SSE 2
- ARM NEON
- All other architectures fallback to a multithreaded scalar reference
implementation (previous was not multithreaded).
- **Note**: only enabled if ORT is built for client/on-device workloads
(`ORT_CLIENT_PACKAGE_BUILD` is defined).
INT8 DequantizeLinear latency on Intel Core i9-10920X with 4 intra op
threads (SSE 2 implementation)
| Number of elements | Baseline latency (us) | Multithreaded+SIMD
latency (us) | Speedup |
| ----------------------- | ---------------------- |
------------------------------------ | ---------- |
| 10 K | 1 | 1 | 1 |
| 20 K | 2 | 2 | 1 |
| 40 K | 5 | 5 | 1 |
| 80 K | 11 | 4 | 2.75 |
| 100 K | 14 | 5 | 2.80 |
| 150 K | 21 | 7 | 3.00 |
| 200 K | 28 | 8 | 3.50 |
| 400 K | 68 | 15 | 4.53 |
| 600 K | 107 | 21 | 5.10 |
| 800 K | 142 | 28 | 5.07 |
| 1 M | 187 | 42 | 4.45 |
| 2 M | 376 | 102 | 3.69 |
| 4 M | 880 | 236 | 3.73 |
| 6 M | 1547 | 557 | 2.78 |
| 8 M | 2438 | 1097 | 2.22 |
| 10 M | 3192 | 1464 | 2.18 |
| 100 M | 38718 | 17733 | 2.18 |
INT8 DequantizeLinear latency on Snapdragon 8cx gen 3 @ 3.4GHz with 4
intra op threads (NEON implementation)
| Number of elements | Baseline latency (us) | Multithreaded+SIMD
latency (us) | Speedup |
| ----------------------- | ---------------------- |
------------------------------------ | ---------- |
| 10 K | 1 | 1 | 1 |
| 20 K | 1 | 1 | 1 |
| 40 K | 3 | 3 | 1 |
| 80 K | 7 | 4 | 1.75 |
| 100 K | 9 | 3 | 3.00 |
| 150 K | 14 | 5 | 2.80 |
| 200 K | 18 | 6 | 3.00 |
| 400 K | 38 | 10 | 3.80 |
| 600 K | 61 | 15 | 4.07 |
| 800 K | 76 | 19 | 4.00 |
| 1 M | 98 | 24 | 4.08 |
| 2 M | 204 | 48 | 4.25 |
| 4 M | 424 | 112 | 3.79 |
| 6 M | 677 | 384 | 1.76 |
| 8 M | 919 | 621 | 1.48 |
| 10 M | 1132 | 776 | 1.46 |
| 100 M | 11842 | 10566 | 1.12 |
### Motivation and Context
Improves latency of quantized QDQ models that with large DQs that
dominate the inference latency.