Fix some bugs of argmin/argmax and min/max (#39212)
Summary:
Partial fix of: https://github.com/pytorch/pytorch/issues/39060
There are actually two bugs:
1. `TensorIterator::get_dim_to_split` is asserting on what it shouldn't be.
2. `min_kernel_impl` and `max_kernel_impl` are setting `out_scalar_t` wrongly. `out_scalar_t` is used to compute indices for accumulation buffer, which is only used when the tensor is large enough.
Both are tested in `test_argminmax_large_axis_cuda`, but unfortunately, this test does not run on CI.
This PR makes `test_argminmax_large_axis_cuda` green, but this test is still not run on CI. I suggest keeping https://github.com/pytorch/pytorch/issues/39060 open until we figure out a way to run it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39212
Differential Revision: D21834723
Pulled By: ngimel
fbshipit-source-id: e8272ac8552c3954ac486ba6e4129fedb545031e