Fix overflow in torch.remainder when dividend is very large (#37758)
Summary:
This will fix the GPU implementation in https://github.com/pytorch/pytorch/issues/37743 and https://github.com/pytorch/pytorch/issues/24861. Please also check my [comment](https://github.com/pytorch/pytorch/issues/37743#issuecomment-623285707).
The fixed `remainder_kernel` follows the similar implementation in numpy. See https://github.com/numpy/numpy/blob/79d7bc276afbe89c746e462d28d4bfbb4fc56148/numpy/core/src/npymath/npy_math_internal.h.src#L649-L658
I also slightly update the doc for `torch.remainder`, to make it similar to `torch.fmod`.
I'm not sure how to modify the Vec256 code of CPU remainder_kernel, so I just leave it there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37758
Differential Revision: D21388417
Pulled By: ngimel
fbshipit-source-id: 770ba5801cf34619b2b68b8b0cf95d8cfa52e6f6