[AArch64] Improve codegen of vectorised early exit loops (#119534)
Once PR #112138 lands we are able to start vectorising more loops
that have uncountable early exits. The typical loop structure
looks like this:
vector.body:
...
%pred = icmp eq <2 x ptr> %wide.load, %broadcast.splat
...
%or.reduc = tail call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> %pred)
%iv.cmp = icmp eq i64 %index.next, 4
%exit.cond = or i1 %or.reduc, %iv.cmp
br i1 %exit.cond, label %middle.split, label %vector.body
middle.split:
br i1 %or.reduc, label %found, label %notfound
found:
ret i64 1
notfound:
ret i64 0
The problem with this is that %or.reduc is kept live after the loop,
and since this is a boolean it typically requires making a copy of
the condition code register. For AArch64 this requires an additional
cset instruction, which is quite expensive for a typical find loop
that only contains 6 or 7 instructions.
This patch attempts to improve the codegen by sinking the reduction
out of the loop to the location of it's user. It's a lot cheaper to
keep the predicate alive if the type is legal and has lots of
registers for it. There is a potential downside in that a little
more work is required after the loop, but I believe this is worth
it since we are likely to spend most of our time in the loop.