Fix overflow in quantize_val_arm (#60079)
Summary:
By using `__builtin_add_overflow` to detect integer overflows when `zero_point` is added to rounded integral value.
Also fix small typo.
After this PR `python3 -c "import torch;print(torch.torch.quantize_per_tensor(torch.ones(10) * 2**32, 0.5, 1, torch.quint8))"` returns same vector of `127` on both x86_64 and aarch64 platforms
This change merely mitigates overflow bug, more proper (and perhaps performance impacting) fix would be to add `zero_point` to floating values both in serial and in vectorized code. Filed https://github.com/pytorch/pytorch/issues/61047 to track this one
Also filed https://github.com/pytorch/pytorch/issues/61046 to clarify intended use of `__ARM_NEON__` define
Fixes https://github.com/pytorch/pytorch/issues/60077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60079
Reviewed By: kimishpatel
Differential Revision: D29157883
Pulled By: malfet
fbshipit-source-id: 6f75d93e6d3d4d0d5a5eab545cb27773086b9768