Fix logic errors when accumulating reductions in output (CUDA) (#16023)
Summary:
The correct logic is as follows:
* If there is an earlier split, we need to combine with its result
* If there is *not* a later split, we need to project before saving into the output.
This should partially f i x #15837 . For example:
```
In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1")
In [8]: a.mean()
Out[8]: tensor(1., device='cuda:1')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023
Differential Revision: D13678449
Pulled By: umanwizard
fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8