slow_conv3d grad_input: Avoid dispatch in parallel region (#65757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65757
See gh-56794
Avoid dispatch inside of parallel_for by:
- Replacing Tensor slicing with TensorAccessor
- Replaces `bmm` and `mm` with direct calls to gemm.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31257878
Pulled By: ngimel
fbshipit-source-id: e6aad2d5ae7fa432bd27af2b1a8b0dcef1fc6653