8350463: AArch64: Add vector rearrange support for small lane count vectors
The AArch64 vector rearrange implementation currently lacks support for
vector types with lane counts < 4 (see [1]). This limitation results in
significant performance gaps when running Long/Double vector benchmarks
on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to
other SVE and x86 platforms.
Vector rearrange operations depend on vector shuffle inputs, which used
byte array as payload previously. The minimum vector lane count of 4 for
byte type on AArch64 imposed this limitation on rearrange operations.
However, vector shuffle payload has been updated to use vector-specific
data types (e.g., `int` for `IntVector`) (see [2]). This change enables
us to remove the lane count restriction for vector rearrange operations.
This patch added the rearrange support for vector types with small lane
count. Here are the main changes:
- Added AArch64 match rule support for `VectorRearrange` with smaller
lane counts (e.g., `2D/2S`)
- Relocated NEON implementation from ad file to c2 macro assembler file
for better handling of complex implementation
- Optimized temporary register usage in NEON implementation for
short/int/float types from two registers to one
Following is the performance improvement data of several Vector API JMH
benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the
same JMH with other vector types remains unchanged.
1) NEON
JMH on panama-vector:vectorIntrinsics:
```
Benchmark (size) Mode Cnt Units Before After Gain
Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x
Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x
Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x
Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x
Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x
Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x
Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x
Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x
Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x
Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x
Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x
Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.999 747.848 10.38x
```
JMH on jdk mainline:
```
Benchmark (SIZE) Mode Cnt Units Before After Gain
SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.593 1319.977 29.63x
SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.318 660.061 29.58x
SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 45.823 1458.144 31.82x
SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.050 729.881 31.67x
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.210 1082.884 11.14x
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.642 541.341 11.13x
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.285 270.419 11.14x
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.421 135.115 10.88x
```
2) SVE
JMH on panama-vector:vectorIntrinsics:
```
Benchmark (size) Mode Cnt Units Before After Gain
Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.396 577.744 7.37x
Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.119 2538.261 35.19x
Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.992 2536.972 34.75x
Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.400 561.934 7.26x
Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.858 2949.076 41.61x
Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.654 2954.273 41.81x
Int64Vector.rearrange 1024 thrpt 30 ops/ms 77.851 563.969 7.24x
Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 67.433 1510.484 22.39x
Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 66.614 1511.617 22.69x
Long128Vector.rearrange 1024 thrpt 30 ops/ms 77.637 579.021 7.46x
Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 69.886 1274.331 18.23x
Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.069 1273.787 18.17x
```
JMH on jdk mainline:
```
Benchmark (SIZE) Mode Cnt Units Before After Gain
SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.612 1351.850 30.30x
SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.315 676.314 30.31x
SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 46.372 1502.036 32.39x
SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.361 749.133 32.07x
VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.780 1759.061 17.99x
VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.923 879.584 17.98x
VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.219 439.588 18.15x
VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.416 219.603 17.69x
```
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L209
[2] https://bugs.openjdk.org/browse/JDK-8310691