[MLAS] Add 8-bit weights ARM64 Gemm implementation (#25110)
### Description
Enable 8-bit weights Gemm on ARM64 via MLAS
1. Supports 2 flavors of the 8-bit Gemm kernel - one uses `vdotq` (U8U8)
and the other uses `vusdotq` (U8S8) on platforms where I8MM is
supported.
2. Provides access to these new MLAS Gemm kernels via the `MatmulNBits`
contrib operator
3. Tests:
**MLAS**
3 new sets of tests:
- `SQ8BitQuantA` : Tests the dynamic activation quantization MLAS kernel
(`fp32 -> uint8_t` or `fp32 -> int8_t` on I8MM platforms)
- `SQ8BitPrepack`: Tests the prepacking of the weights for the 8-bit
Gemm kernels
- `SQ8BitGemm`: Tests the 8-bit Gemm kernels
**MatmulNBits contrib tests**
- Enables the 8-bit Gemm tests on ARM64 (previously only enabled on x86)
### Motivation and Context
Enable 8-bit weights Gemm on ARM64 via MLAS
Based on work and contribution by @fajin-corp
Phi-4-mini-instruct perf numbers (before and after this change):
<img width="593" height="179" alt="image"
src="https://github.com/user-attachments/assets/d81b9059-b8db-407c-8c0f-527099f9358c"
/>
---------
Co-authored-by: Jing Fang <fajin@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>