slow_conv2d grad_input: avoid dispatch in parallel region (#65725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65725
See gh-56794
Avoid dispatch inside of parallel_for by:
1. Replacing Tensor slicing with TensorAccessor
2. Call `grad_input.zero_()` only once, outside of the parallel region
3. Replace `at::mm` with a `gemm` call
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision: D31257876
Pulled By: ngimel
fbshipit-source-id: f2902edeccd161431c1dfb1ab3e165d039ec259d