[AArch64] Fold scalar-to-vector shuffles into DUP/FMOV (#166962)
Previously, LLVM emitted inefficient instructions when the low lanes of
a 128-bit vector were set to a scalar and high bits set to 0. This patch
utilises instructions fmov/dup to set the low lanes to the necessary
scalar and zeroes the high bits of the register.
E.g. in its worse case,
```
int8x16_t foo_s8(int8_t a) {
int8x16_t b = vcombine_s8(vdup_n_s8(a), vdup_n_s8(0));
return b;
}
```
LLVM would emit:
```
foo_s8(signed char):
movi v0.2d, #0000000000000000
mov v0.b[0], w0
mov v0.b[1], w0
mov v0.b[2], w0
mov v0.b[3], w0
mov v0.b[4], w0
mov v0.b[5], w0
mov v0.b[6], w0
mov v0.b[7], w0
ret
```
This patch now emits:
- <2 x i64> from i64 -> fmov d0, x0
- <4 x i32> from i32 -> dup v0.2s, w0
- <8 x i16> from i16 -> dup v0.4h, w0
- <16 x i8> from i8 -> dup v0.8b, w0