Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)
Thanks to discussion with @mikekgfb I've realized that SVE is the
feature availble by default on Apple Silicon, so let use it to speed up
portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
union {
uint16_t h;
float16_t f16;
} x = {h};
return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--)
As results, very slow and naive [`torch.mm`](https://github.com/pytorch/pytorch/blob/edd9ddf73fae023824c854f23abfe2c15bfcfeee/aten/src/ATen/native/cpu/BlasKernel.cpp#L108) runs 3x faster: 85 msec before to 27 msec (measured by running https://github.com/malfet/llm_experiments/blob/e41341df2d75395277d44e3ae770342fca4bdf18/benchmarks/benchmark_torch_mm.py )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895
Approved by: https://github.com/mikekgfb
ghstack dependencies: #119892