[AArch64] recognise trn1/trn2 with flipped operands (#169858)
This PR is very similar to #167235, but applied to `trn` rather than
`zip`. There are two further differences:
- The `@combine_v8i16_8first` and `@combine_v8i16_8firstundef` test
cases in `arm64-zip.ll` didn't have equivalents in `arm64-trn.ll`, so
this PR adds new test cases `@vtrni8_8first`, `@vtrni8_9first`,
`@vtrni8_89first_undef`.
- `AArch64TTIImpl::getShuffleCost` calls `isZIPMask`, but not
`isTRNMask`. It relies on `Kind == TTI::SK_Transpose` instead (which
in turn is based on `ShuffleVectorInst::isTransposeMask` through
`improveShuffleKindFromMask`).
Therefore, this PR does not itself influence the slp-vectorizer. In a
follow-up PR, I intend to override
`AArch64TTIImpl::improveShuffleKindFromMask` to ensure we get
`ShuffleKind::SK_Transpose` based on the new `isTRNMask`. In fact, that
follow-up change is the actual motivation for this PR, as it will result
in
```C++
int8x16_t g(int8_t x)
{
return (int8x16_t) { 0, x, 1, x, 2, x, 3, x,
4, x, 5, x, 6, x, 7, x };
}
```
from #137447 being optimised by the slp-vectorizer.