[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280)
### Description
This submission is a 4-bit quantized matrix multiplication operator
suitable for the Loongson platform. It has passed the internal test
checks of ONNX and has been successfully deployed for actual inference
on the Loongson platform. It includes five modifications:
(1) **sqnbitgemm_kernel_lasx.cpp**: Acceleration of inference for 4-bit
quantized models on the LoongArch64 architecture, utilizing lasx/lsx
vector instruction sets;
(2) **sqnbitgemm_kernel_lasx_common.h**: Implementation of auxiliary
functions used by **sqnbitgemm_kernel_lasx.cpp**`;
(3) **cmake**: Added compilation options for
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture;
(4) **mlasi.h**: Added interface for calling the operator in
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture;
(5) **platform.cpp**: Added calls to the operators in
**sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture.
### Motivation and Context
Loongson has a critical lack of key operations in ONNX quantized model
inference tasks.
The issue of poor inference performance for 4-bit quantized models on
the Loongson platform has been addressed. In tests using the
Deepseek-R1-1.5B model, our operators have increased TPS by more than 7
times, with the speed of quantization matrix dequantization improving by
up to 3 times.
### Pictures
Dequantization Acceleration:
In the chart, the vertical axis represents time in milliseconds (ms),
the horizontal axis represents the number of test matrices, and the size
of the quantized matrix is rows × columns, such as the 1536*256.
<img width="4039" height="831" alt="反量化加速"
src="https://github.com/user-attachments/assets/26da1ed9-79ae-4abd-9e6d-cadaea9ee013"
/>
---------
Co-authored-by: 全都做不队 <t202410611994336@eduxiji.net>