x86/64 U8S8 Gemm Precision Fix (#12088)

Commit

3 years ago

x86/64 U8S8 Gemm Precision Fix (#12088) Add a graph optimization that convert u8s8 matrix multiplication to u8u8 if needed In x86/64 platforms, specifically SSE4.1, AVX2 and AVX512 CPUs provide better performance computing u8s8 matrix multiplications. Unfortunately, the higher performance comes with value overflow problems, as described in: https://www.intel.com/content/www/us/en/develop/documentation/onednn-developer-guide-and-reference/top/advanced-topics/nuances-of-int8-computations.html In this change we added a session option "session.x64quantprecision" (default off). For operators that calls u8s8 matrix multiplications, e.g. QAttention, we convert them to u8u8 when the following conditions are all satisfied: 1. Current CPU is SSE4.1, AVX2 or AVX512 with no VNNI support 2. Session option "session.x64quantprecision" is on. 3. Constant weight tensor contains values outside of [-64, 63] range Note that when weight tensor is not constant, QDQS8ToU8Transformer should already convert it to u8.

References

#12088 - x86/64 U8S8 Gemm Precision Fix

Author

chenfucn

Parents

48647bc7

onnxruntime 040c2f45 - x86/64 U8S8 Gemm Precision Fix (#12088)

onnxruntime
040c2f45 - x86/64 U8S8 Gemm Precision Fix (#12088)