[AArch64] Runtime-unroll small load/store loops for Apple Silicon CPUs. (#118317)
Add initial heuristics to selectively enable runtime unrolling for loops
where doing so is expected to be highly beneficial on Apple Silicon
CPUs.
To start with, we try to runtime-unroll small, single block loops, if
they have load/store dependencies, to expose more parallel memory
access streams [1] and to improve instruction delivery [2].
We also explicitly avoid runtime-unrolling for loop structures that may
limit the expected gains from runtime unrolling. Such loops include
loops with complex control flow (aren't innermost loops, have multiple
exits, have a large number of blocks), trip count expansion is
expensive and are expected to execute a small number of iterations.
Note that the heuristics here may be overly conservative and we err on
the side of avoiding runtime unrolling rather than unroll excessively.
They are all subject to further refinement.
Across a large set of workloads, this increase the total number of
unrolled loops by 2.9%.
[1] 4.6.10 in Apple Silicon CPU Optimization Guide
[2] 4.4.4 in Apple Silicon CPU Optimization Guide
Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI
changes.
PR: https://github.com/llvm/llvm-project/pull/118317