[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590)
Fixes #104729
As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization.
The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path.
| Slow path | Fast path (NEON intrinsics)
-- | -- | --
Softmax (100 passes, 1024 dimension) | 623.706ms | 452.011ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105590
Approved by: https://github.com/jgong5, https://github.com/malfet