8346964: C2: Improve integer multiplication with constant in MulINode::Ideal()
Constant multiplication x*C can be optimized as LEFT SHIFT, ADD or SUB
instructions since generally these instructions have smaller latency
and larger throughput on most architectures. For example:
1. x*8 can be optimized as x<<3.
2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one
SHIFT-ADD (ADD instruction combined with LEFT SHIFT) instruction on
some architectures, like aarch64 and x86_64.
Currently OpenJDK implemented a few such patterns in mid-end, including:
1. |C| = 1<<n (n>0)
2. |C| = (1<<n) - 1 (n>0)
3. |C| = (1<<m) + (1<<n) (m>n, n>=0)
The first two are ok. Because on most architectures they are lowered as
only one ADD/SUB/SHIFT instruction.
But the third pattern doesn't always perform well on some architectures,
such as aarch64. The third pattern can be split as the following sub
patterns:
3.1. C = (1<<n) + 1 (n>0)
3.2. C = -((1<<n) + 1) (n>0)
3.3. C = (1<<m) + (1<<n) (m>n, n>0)
3.4. C = -((1<<m) + (1<<n)) (m>n, n>0)
According to Arm optimization guide, if the shift amount > 4, the
latency and throughput of ADD instruction is the same with MUL
instruction. So in this case, converting MUL to ADD is not profitable.
Take a[i] * C on aarch64 as an example.
Before (MUL is not converted):
```
mov x1, #C
mul x2, x1, x0
```
Now (MUL is converted):
For 3.1:
```
add x2, x0, x0, lsl #n
```
For 3.2:
```
add x2, x0, x0, lsl #n // same cost with mul if n > 4
neg x2, x2
```
For 3.3:
```
lsl x1, x0, #m
add x2, x1, x0, lsl #n // same cost with mul if n > 4
```
For 3.4:
```
lsl x1, x0, #m
add x2, x1, x0, lsl #n // same cost with mul if n > 4
neg x2, x2
```
Test results (ns/op) on Arm Neoverse V2:
```
Before Now Uplift Pattern Notes
testInt9 103.379 60.702 1.70305756 3.1
testIntN33 103.231 106.825 0.96635619 3.2 n > 4
testIntN9 103.448 103.005 1.004300762 3.2 n <= 4
testInt18 103.354 99.271 1.041129837 3.3 m <= 4, n <= 4
testInt36 103.396 99.186 1.042445506 3.3 m > 4, n <= 4
testInt96 103.337 105.416 0.980278136 3.3 m > 4, n > 4
testIntN18 103.333 139.258 0.742025593 3.4 m <= 4, n <= 4
testIntN36 103.208 139.132 0.741799155 3.4 m > 4, n <= 4
testIntN96 103.367 139.471 0.74113615 3.4 m > 4, n > 4
```
**(S1) From this point on, we should treat pattern 3 as follows:**
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (0<n<=4)
3.3 C = (1<<m) + (1<<n) (m>n, 0<n<=4)
3.4 C = -((1<<m) + (1<<n)) (disable)
Since this conversion is implemented in mid-end, it impacts other
optimizations, such as auto-vectorization. Assume there's the following
loop which can be vectorized.
Vector-A:
```
for (int i=0; i<len; i++) {
sum += a[i] * C;
}
```
Before:
```
movi v19.4s, #C // this will be hoisted out of the loop
mla v16.4s, v17.4s, v19.4s
```
After:
For 3.1:
```
shl v19.4s, v17.4s, #m
add v17.4s, v19.4s, v17.4s
add v16.4s, v16.4s, v17.4s
```
For 3.2:
```
(add w11, w11, w11, lsl #m
sub w11, w12, w11) * 4 // *4 is for 4 ints
```
For 3.3:
```
shl v18.4s, v17.4s, #m
shl v19.4s, v17.4s, #n
add v18.4s, v19.4s, v18.4s
add v16.4s, v16.4s, v18.4s
```
For 3.4:
```
(lsl w12, w4, #m
add w11, w12, w4, lsl #n
sub w13, w13, w11) * 4 // *4 is for 4 ints
```
The generated instruction before is more simple and performing:
```
Before Now Uplift Pattern
testInt9AddSum 47.958 63.696 0.752920121 3.1
testIntN33AddSum 48.013 147.834 0.324776438 3.2
testIntN9AddSum 48.026 149.149 0.322000148 3.2
testInt18AddSum 47.971 69.393 0.691294511 3.3
testInt36AddSum 47.98 69.395 0.69140428 3.3
testInt96AddSum 47.992 69.453 0.690999669 3.3
testIntN18AddSum 48.014 157.132 0.305564748 3.4
testIntN36AddSum 48.02 157.094 0.305676856 3.4
testIntN96AddSum 48.032 153.642 0.312622851 3.4
```
**(S2) From this point on, we should disable pattern 3 totally.**
But we can have different cases, for example:
Vector-B:
```
for (int i=0; i<100000; i++) {
a[i] = a[i] * C;
}
```
Test results:
```
Before Now Uplift Pattern
testInt9Store 43.392 33.338 1.301577779 3.1
testIntN33Store 43.365 75.993 0.570644665 3.2
testIntN9Store 43.5 75.452 0.576525473 3.2
testInt18Store 43.442 41.847 1.038115038 3.3
testInt36Store 43.369 41.843 1.03646966 3.3
testInt96Store 43.389 41.931 1.03477141 3.3
testIntN18Store 43.372 57.909 0.748968209 3.4
testIntN36Store 43.373 57.042 0.760369552 3.4
testIntN96Store 43.405 58.145 0.746495829 3.4
```
**(S3) From this point on, we should treat pattern 3 as follow:**
3.1 C = (1<<n) + 1 (n>0)
3.2 C = -((1<<n) + 1) (disable)
3.3 C = (1<<m) + (1<<n) (m>n, n>0)
3.4 C = -((1<<m) + (1<<n)) (disable)
Combining S1, S2 and S3, we get:
Pattern S1 S2 S3
3.1 (n>0, 1.7) (disable, 0.75) (n>0, 1.3)
3.2 (0<n<=4, 1.0) (disable, 0.32) (disable, 0.57)
3.3 (m>n, 0<n<=4, 1.04) (disable, 0.69) (m>n, n>0, 1.03)
3.4 (disable, 0.74) (disable, 0.30) (disable, 0.74)
For 3.1, it's similar with pattern 2, usually be lowered as only one
instruction, so we tend to keep it in mid-end.
For 3.2, we tend to disable it in mid-end, and do S1 in back-end if
it's profitable.
For 3.3, although S3 has 3% performance gain, but S2 has 31% performance
regression. So we tend to disable it in mid-end and redo S1 in back-end.
For 3.4, we shouldn't do this optimization anywhere.
In theory, auto-vectorization should be able to generate the best
vectorized code, and cases that cannot be vectorized will be converted
into other more optimal scalar instructions in the architecture backend
(this is what gcc and llvm do). However, we currently do not have a cost
model and vplan, and the results of auto-vectorization are significantly
affected by its input. Therefore, this patch turns off pattern 3.2, 3.3
and 3.4 in mid-end. Then if it's profitable, implement these patterns in
the backend. If we implement a cost model and vplan in the future, it is
best to move all patterns to the backend, this patch does not conflict
with this direction.
I also tested this patch on Arm N1, Intel SPR and AMD Genoa machines,
No noticeable performance degradation was seen on any of the machines.
Here are the test results on an Arm V2 and an AMD Genoa machine:
```
Benchmark V2-now V2-after Uplift Genoa-now Genoa-after Uplift Pattern Notes
testInt8 60.36989 60.276736 1 116.768294 116.772547 0.99 1
testInt8AddSum 63.658064 63.797732 0.99 16.04973 16.051491 0.99 1
testInt8Store 38.829618 39.054129 0.99 19.857453 20.006321 0.99 1
testIntN8 59.99655 60.150053 0.99 132.269926 132.252473 1 1
testIntN8AddSum 145.678098 146.181549 0.99 158.546226 158.806476 0.99 1
testIntN8Store 32.802445 32.897907 0.99 19.047873 19.065941 0.99 1
testInt7 98.978213 99.176574 0.99 114.07026 113.08989 1 2
testInt7AddSum 62.675636 62.310799 1 23.370851 20.971655 1.11 2
testInt7Store 32.850828 32.923315 0.99 23.884952 23.628681 1.01 2
testIntN7 60.27949 60.668158 0.99 174.224893 174.102295 1 2
testIntN7AddSum 62.746696 62.288476 1 20.93192 20.964557 0.99 2
testIntN7Store 32.812906 32.851355 0.99 23.810024 23.526074 1.01 2
testInt9 60.820402 60.331938 1 108.850777 108.846161 1 3.1
testInt9AddSum 62.24679 62.374637 0.99 20.698749 20.741137 0.99 3.1
testInt9Store 32.871723 32.912065 0.99 19.055537 19.080735 0.99 3.1
testIntN33 106.517618 103.450746 1.02 153.894345 140.641135 1.09 3.2 n > 4
testIntN33AddSum 147.589815 47.911612 3.08 153.851885 17.008453 9.04 3.2
testIntN33Store 75.434513 43.473053 1.73 26.612181 20.436323 1.3 3.2
testIntN9 102.173268 103.70682 0.98 155.858169 140.718967 1.1 3.2 n <= 4
testIntN9AddSum 148.724952 47.963305 3.1 186.902111 20.249414 9.23 3.2
testIntN9Store 74.783788 43.339188 1.72 20.150159 20.888448 0.96 3.2
testInt18 98.905625 102.942092 0.96 142.480636 140.748778 1.01 3.3 m <= 4, n <= 4
testInt18AddSum 68.695585 48.103536 1.42 26.88524 16.77886 1.6 3.3
testInt18Store 41.307909 43.385183 0.95 21.233238 20.875026 1.01 3.3
testInt36 99.039742 103.714745 0.95 142.265806 142.334039 0.99 3.3 m > 4, n <= 4
testInt36AddSum 68.736756 47.952189 1.43 26.868362 17.030035 1.57 3.3
testInt36Store 41.403698 43.414093 0.95 21.225454 20.52266 1.03 3.3
testInt96 105.00287 103.528144 1.01 237.649526 140.643255 1.68 3.3 m > 4, n > 4
testInt96AddSum 68.481133 48.04549 1.42 26.877407 16.918209 1.58 3.3
testInt96Store 41.276292 43.512994 0.94 23.456117 20.540181 1.14 3.3
testIntN18 138.629044 103.269657 1.34 210.315628 140.716818 1.49 3.4 m <= 4, n <= 4
testIntN18AddSum 156.635652 48.003989 3.26 215.807135 16.917665 12.75 3.4
testIntN18Store 57.584487 43.410415 1.32 26.819827 20.707778 1.29 3.4
testIntN36 139.068861 103.766774 1.34 209.522432 140.720322 1.48 3.4 m > 4, n <= 4
testIntN36AddSum 156.36928 48.027779 3.25 215.705842 16.893192 12.76 3.4
testIntN36Store 57.715418 43.493958 1.32 21.651252 20.676877 1.04 3.4
testIntN96 139.151761 103.453665 1.34 269.254161 140.753499 1.91 3.4 m > 4, n > 4
testIntN96AddSum 153.123557 48.110524 3.18 263.262635 17.011144 15.47 3.4
testIntN96Store 57.793179 43.47574 1.32 24.444592 20.530219 1.19 3.4
```
limitations:
1, This patch only analyzes two vector cases, there may be other vector
cases that may get performance regression with this patch.
2, This patch does not implement the disabled patterns in the backend,
I will propose a follow-up patch to implement these patterns in the
aarch64 backend.
3, This patch does not handle the long type, because different
architectures have different auto-vectorization support for long type,
resulting in very different performance, and it is difficult to find a
solution that does not introduce significant performance degradation.