[AArch64] Decompose FADD reductions with known zero elements (#167313)
FADDV is matched into FADDPv4f32 + FADDPv2i32p but this can be relaxed
when one element (usually the 4th) or more are known to be zero.
Before:
```
movi d1, #0000000000000000
mov v0.s[3], v1.s[0]
faddp v0.4s, v0.4s, v0.4s
faddp s0, v0.2s
```
After:
```
mov s1, v0.s[2]
faddp s0, v0.2s
fadd s0, s0, s1
```
When all of the elements are zero, the intrinsic now simply reduces into
a constant instead of emitting two additions.