Implement NEON accelerated implementation of ERF() (#105610)
Fixes #105493
Inspired by the [AVX implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vec256_float.h#L158-L189) for the same.
Perf on a Graviton3 EC2 instance with one OMP thread:
Operation | std math | SLEEF | NEON (this PR)
-- | -- | -- | --
GELU (100 passes) | 1141.897ms | 598.929ms | 515.499ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105610
Approved by: https://github.com/jgong5