pytorch
4eefe728 - Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)

Commit View On GitHub

Commit

219 days ago

Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012) Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en) As results, very slow and naive [`torch.mm`](https://github.com/pytorch/pytorch/blob/edd9ddf73fae023824c854f23abfe2c15bfcfeee/aten/src/ATen/native/cpu/BlasKernel.cpp#L108) runs 3x faster: 85 msec before to 27 msec (measured by running https://github.com/malfet/llm_experiments/blob/e41341df2d75395277d44e3ae770342fca4bdf18/benchmarks/benchmark_torch_mm.py ) This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit "Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)` (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` ) But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012 Approved by: https://github.com/huydhn

Author

malfet

Committer

pytorchmergebot

Parents

3e5e8590

pytorch 4eefe728 - Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)

Commit

pytorch
4eefe728 - Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)