[LV] Vectorize early exit loops with stores using masking (#178454)
This is an alternative approach to vectorizing early exit loops with
stores that avoids needing to add an extra check block. This is a
fairly straightforward approach that should work on vector ISAs
supporting masked memory ops.
The basic approach is to create a mask covering all lanes _before_ any
exiting lane, using cttz.elts and active.lane.mask (which sets all lanes
to true if the uncountable exit wasn't taken). If the uncountable exit
was taken, then there will still be one scalar iteration left to perform
after the vector loop, which will also handle which exit block we should
branch to.
We no longer need to advance exit conditions in the vector body to the
next iteration (compared to the other PR), though we still need to move
the recipes needed to generate the exit condition (depending on which
memory operations are first in the loop).
The advantage this has over a full in-loop mask approach is that we
don't need to form intermediate masks for each uncountable exit; while I
haven't tried to mix this with the ongoing multiple-exit work yet, we
should be able to handle them without increasing the amount of generated
per-exit code. We also won't need to unpick which exit condition was met
first.
For a pseudo-C example of the transformation (with S1 and S2
representing statements with a side effect, like stores, or possibly a
load that may fault if continued past the early exit), given the
following scalar loop:
```c
for (i = 0; i < N; ++i) {
S1;
if (a[i] == threshold)
break;
S2;
}
```
we would have a vector loop and scalar tail like the following:
```c
int i = 0;
for (; i < vecN; i += VF) {
// Move load for uncountable exit condition before other
// operations in the loop.
vecA = a[i]...a[i+VF-1];
// Create mask for all lanes _before_ any uncountable exit.
vecCmp = vecA == splat(threshold);
mask = get.active.lane.mask(0, cttz.elts(vecCmp));
// Execute statements with side effects using the mask
vecS1(mask);
vecS2(mask);
// If there was an uncountable exit, increase IV by the number
// of elements in the mask, and bail out to the scalar tail.
if (any_of(vecCmp)) {
i += cttz.elts(vecCmp);
break;
}
}
// Scalar tail handles remaining iterations, plus any differences
// in exit block for different exits.
for (; i < N; ++i) {
S1;
if (a[i] == threshold)
break;
S2;
}
```
For the mask, given a comparison result of `<0, 0, 1, 0>`, we would
expect a mask of `<1, 1, 0, 0>`.