slow_conv2d_forward: avoid calling dispatcher in parallel region (#65724)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65724
See gh-56794
Avoid dispatch inside of parallel_for by:
1. Replacing Tensor slicing with TensorAccessor
2. Copy bias into output only once, outside of the parallel region
3. Replaces `addmm`_ with a direct call to gemm.
Technically this also adds a new requirement that the output always be
contiguous, but the out argument version isn't exposed or used
anywhere in the `torch.nn` API. So that should be fine.
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision: D31257875
Pulled By: ngimel
fbshipit-source-id: 84d2b39e7f65334bdfcc2c4719f93ee3c514ca32