[pytorch] Fix mkldnn heuristic for multithreaded convolution (#52909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52909
PR #46675 introduced heuristics to use thnn_conv2d for 1x1
convolutions, since mkldnn had a bug that was slowing those cases
down. Unfortunately, the test plan for that PR only tested single-threaded
convolutions; mkldnn is considerably faster on multithreaded convolutions.
An example from yolov3, on 24 cores of a Xeon Platinum 8175M CPU @ 2.50GHz
```
input:{1, 64, 192, 256}, weight:{32, 64, 1, 1}
thnn_conv2d: GFLOPS/s=104.574G/s
mkldnn_convolution: GFLOPS/s=467.357G/s
```
ghstack-source-id: 122627564
Test Plan: Multithreaded 1x1 convolutions
Reviewed By: wconstab, xuzhao9
Differential Revision: D26685272
fbshipit-source-id: e8e05db89e43856969e26570a170c13b3e73ac74