onnxruntime
5274c195 - MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly. (#27575)

Commit
4 days ago
MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly. (#27575) ### Description Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration. ### Motivation and Context Performance improvements observed in prompt processing: - 14% speedup (batch size 1) - 6% speedup (batch size 4) - 4% speedup (batch size 8) Tested with granite-3.1-8b --------- Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
Parents
Loading