[AArch64] Use 0-cycle reg2reg MOVs for FPR32, FPR16, FPR8 (#144152)
This change emits optimized copy instructions for FPR32, FPR16, FPR8
register classes on targets that support it. The implementation is
similar to what has been done for GPR32. It adds 2 regression tests for
FPR32 and FPR16.
Depends on: https://github.com/llvm/llvm-project/pull/143680 to resolve
the test structure.