onnxruntime
df28c7d7 - [Quant tool] Improve performance of int4 weight quantization (#20935)

Commit

1 year ago

[Quant tool] Improve performance of int4 weight quantization (#20935) ### Description - Uses our own quantization functions instead of the ONNX reference implementation of QuantizeLinear when quantizing weights to int4. - Uses a custom function that packs bytes into 4-bit elements. ### Motivation and Context Running the quantization tool to create QDQ models with int4 weights could take up to 7x longer. This PR uses our own quantization and byte packing utilities to improve performance. #### Measurements Model with ~5M parameters to quantize to int4. - Current implementation: **84.5s** - Only replace ONNX QuantizeLinear implementation: **50.3s** (1.68x speedup) - This PR (replace onnx Q impl, custom packing func): **13.5s** (6.26x speedup) --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>

References

#20935 - [Quant tool] Improve performance of int4 weight quantization

Author

adrianlizarraga

Parents

4cb23b02

onnxruntime df28c7d7 - [Quant tool] Improve performance of int4 weight quantization (#20935)

onnxruntime
df28c7d7 - [Quant tool] Improve performance of int4 weight quantization (#20935)