pytorch
e2900461 - Add NEON accelerated torch.mv kernel (#119992)

Commit
297 days ago
Add NEON accelerated torch.mv kernel (#119992) This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table | op | original | F32+NEON | F16+NEON| | ---| -------- | ---------- | ----- | | torch.mv(m, v) | 209.53 usec | 16.25 usec | 14.68 usec | | torch.mv(m.t(), v) | 104.80 usec | 28.68 usec | 24.82 usec | Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used) To investigate: - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992 Approved by: https://github.com/mikekgfb
Author
Committer
Parents
Loading