[pt][quant] Make the CUDA fake quantize logic consistent with CPU fake quantize logic (#49808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49808
In PyTorch, it uses `dst = std::nearbyint(src * inv_scale) + zero_point` instead of the LEGACY `dst = std::nearbyint(src * inv_scale + zero_point)`. However, the CUDA implementation doesn't match this. This Diff makes the CPU and CUDA implementation consistent.
- FBGEMM code pointer: https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/QuantUtils.h#L76-L80
- PyTorch code pointer:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/affine_quantizer.cpp#L306
Test Plan: CI
Reviewed By: dskhudia
Differential Revision: D25694235
fbshipit-source-id: 0a615e559132aafe18543deac1ea5028dd840cb9