The dimension being reduced should not be coalesced by TensorIterator (#47237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838
Also add overload of `<<` for convenience of debugging.
This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788.
Reproduce
```python
import torch
a = torch.zeros(8, 1, 128, 1024, 1024)
a.cuda().sum(1)
```
Before
```
TensorIterator @ 0x7ffd05b10ba0 {
ntensors() = 2
noutputs() = 1
shape() = [1073741824]
strides(*) = {
(0) = [4]
(1) = [4]
}
dtype(*) = {
(0) = Float
(1) = Float
}
is_reduction_ = 1
}
```
After
```
TensorIterator @ 0x7fffc9051010 {
ntensors() = 2
noutputs() = 1
shape() = [1, 1073741824]
strides(*) = {
(0) = [0, 4]
(1) = [536870912, 4]
}
dtype(*) = {
(0) = Float
(1) = Float
}
is_reduction_ = 1
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237
Reviewed By: ejguan
Differential Revision: D24734763
Pulled By: ngimel
fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b