Replace vmlaq_f32 with vfmaq_f32 (fused multiply-add) (#25669)
### Description
The
[vfmaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vfmaq_f32)
intrinsic compiles to the
[FMLA](https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en)
instruction which is more performant than separate `fmul`+`fadd`
instructions that
[vmlaq_f32](https://developer.arm.com/architectures/instruction-sets/intrinsics/vmlaq_f32)
compiles to on latest GCC versions: https://godbolt.org/z/aYc9as5Wh
Note that this is not a breaking change, as vmlaq_f32 compiles to FMLA
instructions already on the latest clang compilers (which are the
default for MacOS ORT builds already)
### Motivation and Context
With this change, the NEON version of `MlasMultiplyAddFloat32x4`
achieves parity with the x86 version that uses `_mm_fmadd_ps`.
It also achieves up to ~15% speedups compared to the current `vmlaq_f32`
implementation when tested on top of #25580