Split on batch dimension when 32bit indexing not enough for convolution forward (#31379)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/22496
This is just a first step towards the support of 64bit convolution on CUDA. In the forward of convolution, if the total tensor size is larger than 2^31, then we split it on the batch dimension. I want to get some review feedback before moving forward for the same splitting approach for backward.
There are real-world use cases that even when N=1 the input is still larger than 2^31. For this case, the splitting would be complicated, so I am planning to modify `use_cudnn` to just dispatch to the slow fallback kernel in PyTorch in a later PR.
Update: `later PR` is https://github.com/pytorch/pytorch/pull/31383
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31379
Differential Revision: D19192018
Pulled By: ngimel
fbshipit-source-id: c26ecc56319ac67c4d5302ffed246b8d9b5eb972