onnxruntime
5e4d8dc3 - Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (#25278)

Commit

291 days ago

Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (#25278) ### Overview This PR introduces a critical fix for **QOperator INT8 symmetric quantization** in ONNX Runtime. It addresses a situation where the computed **bias scale** (`input_scale * weight_scale`) becomes too small, leading to **int32 overflow** or **precision clipping** during bias quantization. ### Problem In symmetric quantization (i.e., zero_point = 0), the bias tensor is quantized using a fixed-point scale: **bias_scale = input_scale * weight_scale** When this value is too small, the quantized int32 bias may exceed the range of `int32`, causing saturation or significant quantization error. This was observed to cause **>51% accuracy loss** in some models. ### Solution This PR adds two new functions to mitigate this: --- #### 🔧 `_adjust_weight_scale_for_int32_bias(...)` Located in `onnx_quantizer.py`, this function: - **Inspects the float bias range** to compute the smallest valid bias scale (based on int32 dynamic range) - **Compares** this threshold against `input_scale * weight_scale` - If too small, **scales up the weight scale** accordingly, to prevent overflow - Supports both per-tensor and per-channel weight quantization cases This logic is **only triggered when**: - The weight's zero point is exactly zero (i.e. symmetric) - The weight data type is `INT8` or `INT16` --- #### 🔄 `_requantize_weight(...)` After weight scale adjustment, this function: - **Finds the original quantized weight** (`q_weight`), scale, and zero point from the initializer list - **Removes** the outdated quantized weight and scale - **Re-quantizes** the original float weights using the new scale and the same zero point - **Re-inserts** them into the model to maintain consistency --- ### Summary of Benefits - ✅ Prevents int32 overflow or saturation during symmetric bias quantization - ✅ Ensures weight and bias quantization remain consistent - ✅ Reduced quantization error from >51.4% to ~3% in test models - ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow (safe for other modes) - ✅ Improves robustness of static quantization for hardware that performs integer-only inference --- ### Code Location - `onnxruntime/quantization/onnx_quantizer.py` - `def _adjust_weight_scale_for_int32_bias(...)` - `def _requantize_weight(...)` - Integrated in `quantize_bias_static(...)` --- Please let me know if you'd like additional test coverage or integration points. Thanks!

References

#25278 - Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing

Author

Bonoy0328

Parents

97ccf3f6

onnxruntime 5e4d8dc3 - Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (#25278)

onnxruntime
5e4d8dc3 - Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (#25278)