pytorch
d833e2f2 - Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)

Commit View On GitHub

Commit

220 days ago

Use ARMV8.2 scalar fp16<->fp32 conversion (#119895) Thanks to discussion with @mikekgfb I've realized that SVE is the feature availble by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--) As results, very slow and naive [`torch.mm`](https://github.com/pytorch/pytorch/blob/edd9ddf73fae023824c854f23abfe2c15bfcfeee/aten/src/ATen/native/cpu/BlasKernel.cpp#L108) runs 3x faster: 85 msec before to 27 msec (measured by running https://github.com/malfet/llm_experiments/blob/e41341df2d75395277d44e3ae770342fca4bdf18/benchmarks/benchmark_torch_mm.py ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895 Approved by: https://github.com/mikekgfb ghstack dependencies: #119892

Author

malfet

Committer

pytorchmergebot

Parents

096ebcca

pytorch d833e2f2 - Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)

Commit

pytorch
d833e2f2 - Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)