slow_conv3d: Avoid dispatch in parallel region (#65737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65737
See gh-56794
Avoid dispatch inside of parallel_for by:
- Replacing Tensor slicing with TensorAccessor
- Copy bias into output only once, outside of the parallel region
- Replaces `addmm_` and `baddbmm_` with direct calls to gemm.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31257874
Pulled By: ngimel
fbshipit-source-id: 20b94daa13082fb1e39eaa8144bfa4c611b61bab