8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations
Since PR [1] has added several new vector operations in VectorAPI
and the X86 backend implementation for them, this patch adds the
AArch64 backend part for NEON/SVE architectures.
The performance of Vector API relative jmh micro benchmarks can
improve about 70x ~ 95x on an AArch64 128-bit vector length sve2
architecture with different UseSVE options. Here is the uplift
details:
```
Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x
ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x
ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x
ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x
ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x
ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x
ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x
ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x
ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x
ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x
ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x
ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x
IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x
IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x
IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x
IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x
IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x
IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x
IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x
IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x
IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x
IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x
IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x
IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x
LongMaxVector.SADD 1024 thrpt 30 82.01x 81.74x 82.153x
LongMaxVector.SADDMasked 1024 thrpt 30 91.61x 92.69x 93.579x
LongMaxVector.SSUB 1024 thrpt 30 81.97x 81.42x 82.991x
LongMaxVector.SSUBMasked 1024 thrpt 30 91.34x 92.47x 93.026x
LongMaxVector.SUADD 1024 thrpt 30 82.44x 81.29x 82.506x
LongMaxVector.SUADDMasked 1024 thrpt 30 92.21x 92.35x 93.419x
LongMaxVector.SUSUB 1024 thrpt 30 82.04x 80.98x 81.761x
LongMaxVector.SUSUBMasked 1024 thrpt 30 91.74x 92.39x 93.375x
LongMaxVector.UMAX 1024 thrpt 30 81.59x 80.21x 82.162x
LongMaxVector.UMAXMasked 1024 thrpt 30 70.09x 92.89x 93.627x
LongMaxVector.UMIN 1024 thrpt 30 82.31x 81.95x 82.298x
LongMaxVector.UMINMasked 1024 thrpt 30 69.85x 92.19x 93.390x
ShortMaxVector.SADD 1024 thrpt 30 80.08x 79.15x 80.310x
ShortMaxVector.SADDMasked 1024 thrpt 30 90.74x 92.00x 93.743x
ShortMaxVector.SSUB 1024 thrpt 30 79.54x 78.67x 80.584x
ShortMaxVector.SSUBMasked 1024 thrpt 30 91.18x 92.10x 93.725x
ShortMaxVector.SUADD 1024 thrpt 30 79.86x 79.37x 80.372x
ShortMaxVector.SUADDMasked 1024 thrpt 30 90.17x 92.43x 93.759x
ShortMaxVector.SUSUB 1024 thrpt 30 79.78x 79.85x 80.744x
ShortMaxVector.SUSUBMasked 1024 thrpt 30 89.99x 91.91x 93.320x
ShortMaxVector.UMAX 1024 thrpt 30 79.87x 79.81x 80.518x
ShortMaxVector.UMAXMasked 1024 thrpt 30 89.69x 91.70x 92.826x
ShortMaxVector.UMIN 1024 thrpt 30 79.11x 77.98x 79.458x
ShortMaxVector.UMINMasked 1024 thrpt 30 90.49x 92.86x 93.323x
```
Tested with `hotspot::hotspot_all` and `jdk::jdk_all`, and no
new regression is found.
[1] https://github.com/openjdk/jdk/pull/20507