POWER : Implement MlasGemmQuantKernel using VSX builtins for M = 1 (#25490)
POWER : Added a VSX-based implementation of MlasGemmQuantKernel
optimized for the case when M = 1.
Verified correctness using ONNX Runtime's built-in tests and
onnxruntime_mlas_tests;no regressions observed.
Evaluated performance using a Granite 8-bit quantized model and observed
approximately 3-5% improvement in token generation speed.
### Description
when M=1 then performed a multiplication using a VSX vector builtin
vec_msum
### Motivation and Context
To improve token generation performance for models with a batch size of
1