onnxruntime
31dcc606 - [MLAS] Add 8-bit weights ARM64 Gemm implementation (#25110)

Commit

170 days ago

[MLAS] Add 8-bit weights ARM64 Gemm implementation (#25110) ### Description Enable 8-bit weights Gemm on ARM64 via MLAS 1. Supports 2 flavors of the 8-bit Gemm kernel - one uses `vdotq` (U8U8) and the other uses `vusdotq` (U8S8) on platforms where I8MM is supported. 2. Provides access to these new MLAS Gemm kernels via the `MatmulNBits` contrib operator 3. Tests: **MLAS** 3 new sets of tests: - `SQ8BitQuantA` : Tests the dynamic activation quantization MLAS kernel (`fp32 -> uint8_t` or `fp32 -> int8_t` on I8MM platforms) - `SQ8BitPrepack`: Tests the prepacking of the weights for the 8-bit Gemm kernels - `SQ8BitGemm`: Tests the 8-bit Gemm kernels **MatmulNBits contrib tests** - Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86) ### Motivation and Context Enable 8-bit weights Gemm on ARM64 via MLAS Based on work and contribution by @fajin-corp Phi-4-mini-instruct perf numbers (before and after this change): <img width="593" height="179" alt="image" src="https://github.com/user-attachments/assets/d81b9059-b8db-407c-8c0f-527099f9358c" /> --------- Co-authored-by: Jing Fang <fajin@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

References

#25110 - [MLAS] Add 8-bit weights ARM64 Gemm implementation

Author

hariharans29

Parents

4a1f9e1e

onnxruntime 31dcc606 - [MLAS] Add 8-bit weights ARM64 Gemm implementation (#25110)

onnxruntime
31dcc606 - [MLAS] Add 8-bit weights ARM64 Gemm implementation (#25110)