jdk
8934fae6 - 8350463: AArch64: Add vector rearrange support for small lane count vectors

Commit
10 hours ago
8350463: AArch64: Add vector rearrange support for small lane count vectors The AArch64 vector rearrange implementation currently lacks support for vector types with lane counts < 4 (see [1]). This limitation results in significant performance gaps when running Long/Double vector benchmarks on NVIDIA Grace (SVE2 architecture with 128-bit vectors) compared to other SVE and x86 platforms. Vector rearrange operations depend on vector shuffle inputs, which used byte array as payload previously. The minimum vector lane count of 4 for byte type on AArch64 imposed this limitation on rearrange operations. However, vector shuffle payload has been updated to use vector-specific data types (e.g., `int` for `IntVector`) (see [2]). This change enables us to remove the lane count restriction for vector rearrange operations. This patch added the rearrange support for vector types with small lane count. Here are the main changes: - Added AArch64 match rule support for `VectorRearrange` with smaller lane counts (e.g., `2D/2S`) - Relocated NEON implementation from ad file to c2 macro assembler file for better handling of complex implementation - Optimized temporary register usage in NEON implementation for short/int/float types from two registers to one Following is the performance improvement data of several Vector API JMH benchmarks, on a NVIDIA Grace CPU with NEON and SVE. Performance of the same JMH with other vector types remains unchanged. 1) NEON JMH on panama-vector:vectorIntrinsics: ``` Benchmark (size) Mode Cnt Units Before After Gain Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.060 578.859 7.42x Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.332 1811.664 25.05x Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.256 1812.344 25.08x Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.879 558.797 7.18x Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.528 1981.304 28.09x Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.735 1994.168 27.79x Int64Vector.rearrange 1024 thrpt 30 ops/ms 76.374 562.106 7.36x Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 71.680 1190.127 16.60x Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.895 1185.094 16.48x Long128Vector.rearrange 1024 thrpt 30 ops/ms 78.902 579.250 7.34x Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.389 747.794 10.33x Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 71.999 747.848 10.38x ``` JMH on jdk mainline: ``` Benchmark (SIZE) Mode Cnt Units Before After Gain SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.593 1319.977 29.63x SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.318 660.061 29.58x SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 45.823 1458.144 31.82x SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.050 729.881 31.67x VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.210 1082.884 11.14x VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.642 541.341 11.13x VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.285 270.419 11.14x VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.421 135.115 10.88x ``` 2) SVE JMH on panama-vector:vectorIntrinsics: ``` Benchmark (size) Mode Cnt Units Before After Gain Double128Vector.rearrange 1024 thrpt 30 ops/ms 78.396 577.744 7.37x Double128Vector.sliceUnary 1024 thrpt 30 ops/ms 72.119 2538.261 35.19x Double128Vector.unsliceUnary 1024 thrpt 30 ops/ms 72.992 2536.972 34.75x Float64Vector.rearrange 1024 thrpt 30 ops/ms 77.400 561.934 7.26x Float64Vector.sliceUnary 1024 thrpt 30 ops/ms 70.858 2949.076 41.61x Float64Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.654 2954.273 41.81x Int64Vector.rearrange 1024 thrpt 30 ops/ms 77.851 563.969 7.24x Int64Vector.sliceUnary 1024 thrpt 30 ops/ms 67.433 1510.484 22.39x Int64Vector.unsliceUnary 1024 thrpt 30 ops/ms 66.614 1511.617 22.69x Long128Vector.rearrange 1024 thrpt 30 ops/ms 77.637 579.021 7.46x Long128Vector.sliceUnary 1024 thrpt 30 ops/ms 69.886 1274.331 18.23x Long128Vector.unsliceUnary 1024 thrpt 30 ops/ms 70.069 1273.787 18.17x ``` JMH on jdk mainline: ``` Benchmark (SIZE) Mode Cnt Units Before After Gain SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 ops/ms 44.612 1351.850 30.30x SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 ops/ms 22.315 676.314 30.31x SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 ops/ms 46.372 1502.036 32.39x SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 ops/ms 23.361 749.133 32.07x VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 ops/ms 97.780 1759.061 17.99x VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 ops/ms 48.923 879.584 17.98x VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 ops/ms 24.219 439.588 18.15x VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 ops/ms 12.416 219.603 17.69x ``` [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L209 [2] https://bugs.openjdk.org/browse/JDK-8310691
References
Author
Committer
Parents
Loading