jdk
0a62dc33 - 8343689: AArch64: Optimize MulReduction implementation

Commit

124 days ago

8343689: AArch64: Optimize MulReduction implementation Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length: Benchmark (size) Mode Old New Units Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length: Benchmark (size) Mode Old New Units Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms

Author

mikabl-arm

Committer

mikabl-arm

Parents

b0e2be6f

Files7

src/hotspot/cpu/aarch64
- aarch64_vector.ad
- aarch64_vector_ad.m4
- assembler_aarch64.hpp
- c2_MacroAssembler_aarch64.cpp
- c2_MacroAssembler_aarch64.hpp
test/hotspot/gtest/aarch64
- aarch64-asmtest.py
- asmtest.out.h

jdk 0a62dc33 - 8343689: AArch64: Optimize MulReduction implementation

jdk
0a62dc33 - 8343689: AArch64: Optimize MulReduction implementation