onnxruntime
ba11af41 - [QNN-EP] Add MatMulNBits translation for GPU (#26340)

Commit

88 days ago

[QNN-EP] Add MatMulNBits translation for GPU (#26340) ### Description Add support for translation of MatMulNBits contrib op to QNN with FullyConnected operation with INT4 BlockQuantized weights Implementation details: - Translate MatMulNBits to FullyConnected in OpBuilder - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights - Pass INT4 weights and quant params as BlockQuantization encoding params in QNN Testing: - Added new unit tests for MNB -> QNN-GPU - Validated all OnnxRuntime tests - Validated the following LLMs through Olive and ORT-GenAI execution flow - LlaMA3.2 1B - Qwen2.5 - DeepSeek-R1-Qwen 1.5b - Phi3.5-mini-instruct ### Motivation and Context LLMs with INT4 quantization pass in Olive will generate a model with MatMulMBits contrib ops. To run these ops via QNN-EP, MatMulNBits is translated to QNN FullyConnected op with INT4 weights. --------- Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>

References

#26340 - [QNN-EP] Add MatMulNBits translation for GPU

Author

quic-tirupath

Parents

b6ed7f3b

onnxruntime ba11af41 - [QNN-EP] Add MatMulNBits translation for GPU (#26340)

onnxruntime
ba11af41 - [QNN-EP] Add MatMulNBits translation for GPU (#26340)