[MLAS][KleidiAI]Catlaw01/sgemm epilogue neon opt (#27609)
### Description
This change updates the KleidiAI SGEMM post-processing path in
onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp with two parts:
- Correctness fix: in the alpha == 0 || K == 0 fast path, beta handling
is now applied for every batch entry (not just batch 0), so batched
SGEMM behaviour is correct.
- NEON SGEMM epilogue optimisation: adds a vectorised alpha/beta
post-processing path for contiguous outputs, with guarded fallback to
scalar for non-contiguous or small cases. The 2D epilogue path also
routes contiguous tiles through the contiguous 1D epilogue path to
enable vectorisation.
### Motivation and Context
This change addresses correctness and performance in the SGEMM
post-processing stage:
- The batched alpha == 0 || K == 0 path previously used only Data[0],
which could produce incorrect results for BatchSize > 1.
- The post-processing loop (C = alpha * (A*B) + beta * C) is a known
latency contributor when memcpy fast paths are not applicable. The NEON
epilogue changes are intended to reduce this cost on supported ARM
platforms while preserving existing fallback behaviour.
---------
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>