onnxruntime
93ee7bf0 - [QNN EP] MatMul+Add->Gemm fusion when AttentionFusion isn't enabled (#25017)

Commit

212 days ago

[QNN EP] MatMul+Add->Gemm fusion when AttentionFusion isn't enabled (#25017) ### Description MatMul+Add->Gemm fusion when AttentionFusion isn't enabled. ### Motivation and Context Graph transformation [MatMulAddFusion](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc) fold `ONNX::MatMul` followed by `ONNX::Add` into `ONNX::GEMM`, however, it [intentionally skipping the portion belongs to "Attention Pattern"](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/matmul_add_fusion.cc#L21). This result in poor performance on QNN EP (and other EPs who does not run *AttentionFusion transformers) due to unfused MatMul + Add pairs. ![image](https://github.com/user-attachments/assets/cad0b2c6-ab07-4ced-a647-396c04fed365) With this change, additional GEMM would be fused *post* AttentionFusions.

References

#25017 - [QNN EP] MatMul+Add->Gemm fusion when AttentionFusion isn't enabled

Author

qti-yuduo

Parents

f3c18ed1

onnxruntime 93ee7bf0 - [QNN EP] MatMul+Add->Gemm fusion when AttentionFusion isn't enabled (#25017)

onnxruntime
93ee7bf0 - [QNN EP] MatMul+Add->Gemm fusion when AttentionFusion isn't enabled (#25017)