KleidiAI SGEMM/IGEMM/Quantized MatMul - Modular MLAS API Changes for KleidiAI (#25187)
This PR introduces the initial integration of KleidiAI-optimized
microkernels into ONNX Runtime's MLAS backend, focusing on support for:
- SGEMM
- IGEMM
- Dynamic Quantized MatMuls
Key changes:
Implements overrides for MlasGemmBatch, MlasGemmPackBSize, and
MlasGemmPackB using KleidiAI where applicable.
Applies dispatch logic based on TransA == CblasNoTrans and SME2
availability.
Supports float32 and int8 GEMM workloads with conditionally invoked SME2
paths.
Maintains fallback paths to default MLAS implementations to ensure
coverage and stability.
**Known Issues / Next Steps:**
Requesting feedback specifically on the API structure:
Does the new MLAS interface design align with long-term extensibility?
Are the dispatch points and override boundaries well-structured?
Indicative Performance figures:
The kernels added are particularly effective for Conv2D operators:
* Based on KleidiAI SME running mobilenet_v1_ssd_f32 on Mac Mini M4 on a
single thread
<img width="815" height="308" alt="image"
src="https://github.com/user-attachments/assets/e39a7fef-1370-4332-83a3-1f3a80b29da4"
/>
---------
Signed-off-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Declan Flavin <declan.flavin@arm.com>
Co-authored-by: Colm Donelan <colm.donelan@arm.com>
Co-authored-by: Damien Dooley <damdoo01@ip-10-249-28-46.eu-west-1.compute.internal>