Optimize grouped Conv3d performance (#36355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36355
Resolving issue in https://github.com/pytorch/pytorch/issues/36155, by:
- supporting grouped conv3d in ```slow_conv3d```
- adding a fast path in ```__convolution``` to call ```slow_conv3d``` when
running grouped conv3d on CPU
- bypassing unfolding when kernel_size = 1
Test Plan:
Added the following test cases in test_nn.py, testing both forward and
backward:
- test_Conv3d_groups_nobias
- test_Conv3d_groups_wbias
- test_Conv_1x1
Imported from OSS
Differential Revision: D20957073
fbshipit-source-id: 29afd1e6be8c484859eaedd51463954e2fdccc38