onnxruntime
91e91186 - [QNN EP] Add LowPowerBlockQuantization support for Gemm node (#25458)

Commit
183 days ago
[QNN EP] Add LowPowerBlockQuantization support for Gemm node (#25458) ### Description - Low Power Block Quantization(LPBQ) is widely used to accelerate accuracy sensitive models via QNN(Qualcomm Neural Network) stack. - LPBQ encoding format is Qualcomm's alternative for BlockQuantization technique. - The current implementation expects LPBQ encodings packed in a node sequence (DQ -> Q -> DQ) - This PR folds LPBQ pattern on Weight of Gemm nodes into a Qnn BlockExpansion encoding structure. - This PR adds INT4 Quantization support ### Motivation and Context - This enables acceleration of accuracy sensitive models that require block quantization kind of encodings via QNN EP - This avoids fallback of nodes consuming block quantized tensors on CPU EP and further improves inference time.
Author
Parents
Loading