onnxruntime
af4bf436 - Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (#25669)

Commit

322 days ago

Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (#25669) ### Description The [vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32) intrinsic compiles to the [FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en) instruction which is more performant than separate `fmul`+`fadd` instructions that [vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32) compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA instructions already on the latest clang compilers (which are the default for MacOS ORT builds already) ### Motivation and Context With this change, the NEON version of `MlasMultiplyAddFloat32x4` achieves parity with the x86 version that uses `_mm_fmadd_ps`. It also achieves up to ~15% speedups compared to the current `vmlaq_f32` implementation when tested on top of #25580

References

#25669 - Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add)

Author

Rohanjames1997

Parents

5746ba9d

onnxruntime af4bf436 - Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (#25669)

onnxruntime
af4bf436 - Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (#25669)