Add RISC-V Vector (RVV) support for CPU Execution Provider (#28261)
## Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Close #17466 and #24596
MLAS already provides architecture-specific optimized kernels for
multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE
on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x.
However, riscv64 has not had comparable RVV-optimized coverage for the
operators in this PR and has mainly fallen back to scalar code.
This PR introduces **RISC-V Vector (RVV)** extension support to the ONNX
Runtime CPU Execution Provider.
This PR focuses on two operators: SGEMM and Softmax.
We have already completed optimizations for several other operators.
Following the acceptance of this PR, I will work with @qiurui144 to
upstream the remaining optimized kernels in a series of subsequent PRs.
## Benchmark Results
### SGEMM
| Case | pack_b | RVV pack ms | RVV compute ms | Scalar pack ms | Scalar
compute ms | Compute speedup | End-to-end speedup |
|---|---:|---:|---:|---:|---:|---:|---:|
| 128x3072x768 | 1 | 63.21 | 114.52 | 66.71 | 414.44 | 3.62x | 2.71x |
| 64x1024x1024 | 1 | 22.07 | 27.66 | 23.14 | 96.64 | 3.49x | 2.41x |
| 32x4096x1024 | 1 | 119.04 | 56.82 | 118.86 | 188.34 | 3.31x | 1.75x |
### Softmax
| Case | Scalar ms | RVV ms | Speedup |
|---|---:|---:|---:|
| 4096x128 | 1955.25 | 611.65 | 3.20x |
| 1024x1024 | 717.26 | 236.73 | 3.03x |