onnxruntime
6c522b4d - Fix FP8 (FLOAT8E4M3FN) quantization scale using wrong reference distribution (#29350)

Commit

2 days ago

Fix FP8 (FLOAT8E4M3FN) quantization scale using wrong reference distribution (#29350) ## Problem `compute_scale_zp_float8` (in `onnxruntime/python/tools/quantization/quant_utils.py`) computes the FP8 quantization scale as `scale = std_data / std_f8`, where `std_f8` is the standard deviation of the representable `FLOAT8E4M3FN` values. It built that reference distribution as: ```python all_values = [float(i) for i in range(256)] ``` That's the integers `0.0 .. 255.0` — **not** the float8 values. It should reinterpret each of the 256 byte patterns as a `float8_e4m3fn` value (the finite set spanning `-448..448`). This is a regression from the ONNX 1.19 integration that removed `onnx.numpy_helper.float8e4m3_to_float32` (the prior code was `[float8e4m3_to_float32(i) for i in range(256)]`); the repo's own reference notebook `docs/python/notebooks/quantization_f8.ipynb` still documents the correct algorithm. Effect: `std_f8` is computed as **73.90** instead of **100.06**, so every FP8 scale is **~35% too large**, degrading FP8-quantized model accuracy. The path is live — called from `onnx_quantizer.py` and `qdq_quantizer.py`. ## Reproduction (real function) ```python compute_scale_zp_float8(TensorProto.FLOAT8E4M3FN, numpy.float32(1.0)) # before: scale = 0.01353175 (distribution = 0..255, n=256, std=73.90) # after: scale = 0.00999423 (distribution = -448..448, n=254, std=100.06) ``` ## Fix ```python all_values = numpy.arange(256, dtype=numpy.uint8).view(float8_e4m3fn).astype(numpy.float32) ``` The existing `not numpy.isnan(f) and not numpy.isinf(f)` filter then drops the 2 NaN byte patterns, leaving the 254 finite float8 values. `float8_e4m3fn` and `numpy` are already imported. ## Test Adds `test_compute_scale_zp_float8` to `onnxruntime/test/python/quantization/test_quant_util.py` asserting `scale == std / 100.0577` (and linearity in `std`). It fails on the old code (`std_f8` 73.9) and passes after the fix.

References

#29350 - Fix FP8 (FLOAT8E4M3FN) quantization scale using wrong reference distribution

Author

Osamaali313

Parents

c3a5222d

onnxruntime 6c522b4d - Fix FP8 (FLOAT8E4M3FN) quantization scale using wrong reference distribution (#29350)

onnxruntime
6c522b4d - Fix FP8 (FLOAT8E4M3FN) quantization scale using wrong reference distribution (#29350)