Remove the construction of unused tensors (#79183)
Hi there, this statement, [`auto columns = at::empty({nInputPlane * kW * kH, outputHeight * outputWidth}, input.options());`](https://github.com/pytorch/pytorch/blob/95b15c266baaf989ef7b6bbd7c23a2d90bacf687/aten/src/ATen/native/cuda/ConvolutionMM2d.cu#L154), will construct a new tensor and allocate device memory for it. But I found this tensor will be only used([line185](https://github.com/pytorch/pytorch/blob/95b15c266baaf989ef7b6bbd7c23a2d90bacf687/aten/src/ATen/native/cuda/ConvolutionMM2d.cu#L185) and [line197](https://github.com/pytorch/pytorch/blob/95b15c266baaf989ef7b6bbd7c23a2d90bacf687/aten/src/ATen/native/cuda/ConvolutionMM2d.cu#L197)) when [`requires_columns`](https://github.com/pytorch/pytorch/blob/95b15c266baaf989ef7b6bbd7c23a2d90bacf687/aten/src/ATen/native/cuda/ConvolutionMM2d.cu#L156) is true.
So we can declare an `at::Tensor columns;` variable(This will not allocate device memory for `columns`), and invoke `at::empty` to construct the tensor when `requires_columns` is true. As for the statement [`int64_t n = columns.size(1);`](https://github.com/pytorch/pytorch/blob/95b15c266baaf989ef7b6bbd7c23a2d90bacf687/aten/src/ATen/native/cuda/ConvolutionMM2d.cu#L192), the size can be calculated by the arguments of function `slow_conv2d_forward`.
I profiled the resnet50 in [`pytorch/benchmark`](https://github.com/pytorch/benchmark). I found there are lots of unused tensors in the device memory and they are gone after my optimization. It also works well on vgg16, yolov3, alexnet, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79183
Approved by: https://github.com/ngimel