Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (#25278)
### Overview
This PR introduces a critical fix for **QOperator INT8 symmetric
quantization** in ONNX Runtime. It addresses a situation where the
computed **bias scale** (`input_scale * weight_scale`) becomes too
small, leading to **int32 overflow** or **precision clipping** during
bias quantization.
### Problem
In symmetric quantization (i.e., zero_point = 0), the bias tensor is
quantized using a fixed-point scale:
**bias_scale = input_scale * weight_scale**
When this value is too small, the quantized int32 bias may exceed the
range of `int32`, causing saturation or significant quantization error.
This was observed to cause **>51% accuracy loss** in some models.
### Solution
This PR adds two new functions to mitigate this:
---
#### 🔧 `_adjust_weight_scale_for_int32_bias(...)`
Located in `onnx_quantizer.py`, this function:
- **Inspects the float bias range** to compute the smallest valid bias
scale (based on int32 dynamic range)
- **Compares** this threshold against `input_scale * weight_scale`
- If too small, **scales up the weight scale** accordingly, to prevent
overflow
- Supports both per-tensor and per-channel weight quantization cases
This logic is **only triggered when**:
- The weight's zero point is exactly zero (i.e. symmetric)
- The weight data type is `INT8` or `INT16`
---
#### 🔄 `_requantize_weight(...)`
After weight scale adjustment, this function:
- **Finds the original quantized weight** (`q_weight`), scale, and zero
point from the initializer list
- **Removes** the outdated quantized weight and scale
- **Re-quantizes** the original float weights using the new scale and
the same zero point
- **Re-inserts** them into the model to maintain consistency
---
### Summary of Benefits
- ✅ Prevents int32 overflow or saturation during symmetric bias
quantization
- ✅ Ensures weight and bias quantization remain consistent
- ✅ Reduced quantization error from >51.4% to ~3% in test models
- ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow
(safe for other modes)
- ✅ Improves robustness of static quantization for hardware that
performs integer-only inference
---
### Code Location
- `onnxruntime/quantization/onnx_quantizer.py`
- `def _adjust_weight_scale_for_int32_bias(...)`
- `def _requantize_weight(...)`
- Integrated in `quantize_bias_static(...)`
---
Please let me know if you'd like additional test coverage or integration
points. Thanks!