onnxruntime
7cc28b0b - [LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280)

Commit

66 days ago

[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280) ### Description This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications: (1) **sqnbitgemm_kernel_lasx.cpp**: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets; (2) **sqnbitgemm_kernel_lasx_common.h**: Implementation of auxiliary functions used by **sqnbitgemm_kernel_lasx.cpp**`; (3) **cmake**: Added compilation options for **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (4) **mlasi.h**: Added interface for calling the operator in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (5) **platform.cpp**: Added calls to the operators in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture. ### Motivation and Context Loongson has a critical lack of key operations in ONNX quantized model inference tasks. The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times. ### Pictures Dequantization Acceleration： In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256. <img width="4039" height="831" alt="反量化加速" src="https://github.com/user-attachments/assets/26da1ed9-79ae-4abd-9e6d-cadaea9ee013" /> --------- Co-authored-by: 全都做不队 <t202410611994336@eduxiji.net>

References

#26280 - [LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models

Author

movedancer

Parents

7804b5c1

onnxruntime 7cc28b0b - [LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280)

onnxruntime
7cc28b0b - [LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280)