Fix index truncation in argmin/max for large tensors (#33310)
Summary:
Fixes the `TensorIterator` parts of https://github.com/pytorch/pytorch/issues/32863 (THC is still broken)
`TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33310
Differential Revision: D19906136
Pulled By: ngimel
fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b