Fix prelu_backward TensorIterator split (#36134)
Summary:
We should have
```C++
for (auto& sub_iter : iter.with_32bit_indexing()) {
launch_prelu_cuda_backward_share_weights_kernel(sub_iter, weight_data);
}
```
But I mistakenly wrote it as
```C++
for (auto& sub_iter : iter.with_32bit_indexing()) {
launch_prelu_cuda_backward_share_weights_kernel(iter, weight_data);
}
```
in my previous PR. Which leads to infinite recursion on it.
I found this bug when working on https://github.com/pytorch/pytorch/pull/34004
I also add a `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` to test for this.
Besides, the caller is already guaranteed contiguous, so we don't need to handle no-contiguous tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36134
Differential Revision: D21187542
Pulled By: VitalyFedyunin
fbshipit-source-id: 0fafdd7b672bf89fcaa2b42e08b7d41ade7e6bcb