Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)
Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
union {
uint16_t h;
float16_t f16;
} x = {h};
return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en)
As results, very slow and naive [`torch.mm`](https://github.com/pytorch/pytorch/blob/edd9ddf73fae023824c854f23abfe2c15bfcfeee/aten/src/ATen/native/cpu/BlasKernel.cpp#L108) runs 3x faster: 85 msec before to 27 msec (measured by running https://github.com/malfet/llm_experiments/blob/e41341df2d75395277d44e3ae770342fca4bdf18/benchmarks/benchmark_torch_mm.py )
This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit
"Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)` (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` )
But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012
Approved by: https://github.com/huydhn