Remove incorrect stride assert in Reduce.cuh (#65227)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583
Per discussion with ngimel, the condition asserted here may not always hold after TensorIterator's dimension coalescing and reordering. However, the reduction output should still be correct when `sub_iter.strides(0)[0]` is non-zero.
I've verified correctness empirically by:
1. Lowering the threshold ([configured here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/TensorIterator.cpp#L1127)) at which iterators are split into sub-iterators, making it easier to trigger.
2. Generating many tensors with random dimensions and randint elements which produce a non-zero `sub_iter.strides(0)[0]` in the CUDA kernel.
3. Verifying that the reduction `t.sum(dim=0)` produces the same results for those tensors on CPU and on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65227
Reviewed By: ngimel
Differential Revision: D31031406
Pulled By: saketh-are
fbshipit-source-id: 5cbf2001224454c74f6db42455c507365ad1f2b1