S390x simd implementation (#25757)
### Description
This change adds SIMD-optimized implementation of functions for s390x.
This implementation is based on similar functions for ppc64le.
#### Build System Integration (onnxruntime_mlas.cmake):
* Adds a new S390X flag to the CMake build system to detect the target
architecture.
* Includes new source files specific to s390x (SgemmKernel.cpp,
DgemmKernel.cpp, Quantize.cpp, qgemm_kernel_zvector.cpp, etc.).
* Sets the necessary compiler flags (-mvx, -mzvector, -march=z15) to
enable z/Vector extensions.
#### Platform Abstraction (mlasi.h, platform.cpp):
* Defines MLAS_TARGET_S390X and MLAS_ZVECTOR_INTRINSICS for conditional
compilation.
* Integrates the new s390x kernels into the MLAS_PLATFORM dispatch
table.
* platform.cpp now checks for z/Vector support at runtime using
getauxval(AT_HWCAP) and HWCAP_S390_VXE, allowing it to fall back to
scalar implementations if the hardware support is not present.
#### New Kernel Implementations:
* qgemm_kernel_zvector.cpp: Implements quantized integer matrix
multiplication. This is the core of the performance improvement for
quantized models.
* SgemmKernelZVECTOR.cpp / DgemmKernelZVECTOR.h: Implements single and
double-precision floating-point GEMM.
* QuantizeZVECTOR.cpp / Quantize.cpp: Implements quantization and
requantization kernels.
* FgemmKernelZVECTOR.h: A generic header providing templates and macros
for both single and double-precision GEMM, similar to the ppc64le
implementation.
### Motivation and Context
This change improves performance of onnxruntime on s390x.