[AArch64][SME] Improve codegen for aarch64.sme.cnts* when not in streaming mode (#154761)
Builtins for reading the streaming vector length are canonicalised to
use the aarch64.sme.cntsd intrinisic and a multiply, i.e.
- cntsb -> cntsd * 8
- cntsh -> cntsd * 4
- cntsw -> cntsd * 2
This patch also removes the LLVM intrinsics for cnts[b,h,w], and adds
patterns to improve codegen when cntsd is multiplied by a constant.