[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
- DequantizeLinear and QuantizeLinear contrib ops
- QNN EP Quantize and Dequantize operators
- Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.
Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
quantize_static(
input_model_path,
output_model_path,
data_reader,
quant_format=args.quant_format,
per_channel=args.per_channel,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8,
extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
)
```
Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.
### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation
Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding
QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops
Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>