[AArch64] Support lowering smaller than legal LOOP_DEP_MASKs to whilewr/rw (#171982)
This adds support for lowering smaller-than-legal masks such as:
```
<vscale x 8 x i1> @llvm.loop.dependence.war.mask.nxv8i1(ptr %a, ptr %b, i64 1)
```
To a whilewr + unpack. It also slightly simplifies the lowering.