feat(quantization): add calibration cache to quantize_static (#28221)
## Summary
- Add an optional `calibration_cache_path` parameter to
`quantize_static()` so users can save and reload the calibration result
(`TensorsData`) across runs.
- Avoids re-running the expensive calibration inference pass when
iterating on post-calibration options such as `nodes_to_exclude`,
`activation_type`, or `weight_type`.
- Cache format is JSON, mirroring the encoder already used by
`write_calibration_table` — no new serialization surface area.
## Motivation
Fixes #21908. Users commonly re-run `quantize_static` multiple times on
the same model and calibration dataset while varying the set of excluded
nodes or the quant types, to trade off accuracy vs. speed. Today, every
call repeats the full calibration inference loop even though the
calibration result is identical, which is costly on large calibration
datasets. There was no supported way to persist the computed tensor
ranges — `write_calibration_table` writes a lossy table (drops histogram
data) and has no paired reader. This PR closes that gap.
## Changes
- `python/tools/quantization/calibrate.py`:
- Add `TensorData.from_dict` and `TensorsData.from_dict` classmethods
(inverse of existing `to_dict`).
- Add module-level `_CalibrationCacheEncoder(json.JSONEncoder)`,
`save_tensors_data(tensors, path)`, and `load_tensors_data(path)`. The
encoder handles
`TensorData`/`TensorsData`/`np.ndarray`/`CalibrationMethod`/numpy
scalars. Writes are atomic (tmp file + `os.replace`) and auto-create
parent directories.
- `python/tools/quantization/quantize.py`:
- `quantize_static` gains `calibration_cache_path: str | Path | None =
None`. If the path exists, calibration is skipped and ranges are loaded
from the cache. If the path is new, calibration runs and the result is
saved. Raises `ValueError` if the cached `calibration_method` does not
match the caller's `calibrate_method`.
- `calibration_data_reader` becomes optional; at least one of it or an
existing cache must be provided, else `ValueError`.
- `python/tools/quantization/__init__.py`: export `TensorData`,
`TensorsData`, `save_tensors_data`, `load_tensors_data`.
- Tests: new `TestCalibrationCache` in
`test/python/quantization/test_calibration.py` covering MinMax
roundtrip, Entropy roundtrip (with histogram), missing-path error,
parent-dir auto-creation, numpy scalar `bins` handling, method-mismatch
guard, end-to-end `quantize_static` cache hit/miss, and `ValueError`
when neither reader nor cache is provided.
## Test Plan
- `python -m pytest
onnxruntime/test/python/quantization/test_calibration.py::TestCalibrationCache
-v`
- `python -m pytest
onnxruntime/test/python/quantization/test_calibration.py::TestCalibrateMinMaxCalibrator
-v` (regression)
- `lintrunner -a` on changed files: clean.
## Backward Compatibility
`calibration_data_reader` changes from required-positional to
optional-keyword. Existing call sites — whether positional or keyword —
continue to work unchanged. The new behavior is only engaged when
`calibration_cache_path` is provided.