Improvements to the INT8 GEMM portion of the code for Power (#20595)
These are changes to improve GEMM portion of the code for Power.
There are 2 main code changes :
1) Changing a function to a template parameter so that operations that
add/sub zero are eliminated at compile time. Plus reuse a vector that
has the mask instead of rebuilding each time.
2) Add processing 16 columns at a time in MlasGemmQuantCopyPackB8x8 -
this should reduce potential page faults by a factor of 4 and also be
faster.
3) Unroll MlasQgemmStoreVectorMMA and vectorize other variables.