MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly. (#27575)
### Description
Introduce an optimized POWER10 PackA implementation leveraging VSX
builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes
per row per iteration.
### Motivation and Context
Performance improvements observed in prompt processing:
- 14% speedup (batch size 1)
- 6% speedup (batch size 4)
- 4% speedup (batch size 8)
Tested with granite-3.1-8b
---------
Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>