Fix consistentcy of histc on CPU and CUDA (#87832)
Fixes #87657
The main reason why `histc` returns slightly different outputs is the difference on how bin position is calculate.
The CPU calculates it as: https://github.com/pytorch/pytorch/blob/449778a939f2adc8867c5035b08be4e2d88339d8/aten/src/ATen/native/cpu/HistogramKernel.cpp#L168-L170
which is basically `(i - a) / (b - a) * N`, while cuda code https://github.com/pytorch/pytorch/blob/449778a939f2adc8867c5035b08be4e2d88339d8/aten/src/ATen/native/cuda/SummaryOps.cu#L41
which is `(i - a) * N / (b - a)`.
For some cases like in #87657 the order of arithmetic operations matters due to the floating point round-off.
________________
Not sure where would be the most appropriate place to put the unit test. Hope `test_reductions::test_histc` will do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87832
Approved by: https://github.com/soumith