Fix index_put when tensor length > int_max (#33753)
Summary:
This PR would fix https://github.com/pytorch/pytorch/issues/33345.
The original CUDA kernel looks good. I changed most appearances of `int` to `int64_t` to avoid the CUDA memory access issue. Removed the two `TORCH_CHECK`. Added a unit test.
cc csarofeen ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33753
Differential Revision: D20185005
Pulled By: ngimel
fbshipit-source-id: ef0abdc12ea680e10fe6b85266e2773c7a272f0d