llvm-project
0bb7bd4b - [AArch64] Runtime-unroll small load/store loops for Apple Silicon CPUs. (#118317)

Commit
332 days ago
[AArch64] Runtime-unroll small load/store loops for Apple Silicon CPUs. (#118317) Add initial heuristics to selectively enable runtime unrolling for loops where doing so is expected to be highly beneficial on Apple Silicon CPUs. To start with, we try to runtime-unroll small, single block loops, if they have load/store dependencies, to expose more parallel memory access streams [1] and to improve instruction delivery [2]. We also explicitly avoid runtime-unrolling for loop structures that may limit the expected gains from runtime unrolling. Such loops include loops with complex control flow (aren't innermost loops, have multiple exits, have a large number of blocks), trip count expansion is expensive and are expected to execute a small number of iterations. Note that the heuristics here may be overly conservative and we err on the side of avoiding runtime unrolling rather than unroll excessively. They are all subject to further refinement. Across a large set of workloads, this increase the total number of unrolled loops by 2.9%. [1] 4.6.10 in Apple Silicon CPU Optimization Guide [2] 4.4.4 in Apple Silicon CPU Optimization Guide Depends on https://github.com/llvm/llvm-project/pull/118316 for TTI changes. PR: https://github.com/llvm/llvm-project/pull/118317
Author
Parents
Loading