use parallel_for in DepthwiseConvKernel (#26879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26879
Integrate with the at::parallel_for API for mobile.
Test Plan:
- Verified numerical results are the same as before.
- Benchmarked depthwise3x3_winograd layers in MobileNetV2 on two devices:
```
+-------------------+----------------+--------+-----------+----------+------------+-----------+
| Input | Kernel | Groups | S9 Single | S9 Multi | OP5 Single | OP5 Multi |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
| [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 6796 | 1676 | 8520 | 5361 |
| [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 8004 | 5523 | 9591 | 4157 |
| [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2771 | 730 | 3345 | 1436 |
| [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2688 | 730 | 3358 | 1979 |
| [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1641 | 461 | 1895 | 874 |
| [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1765 | 444 | 1914 | 870 |
| [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1636 | 448 | 1896 | 852 |
| [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1639 | 452 | 1964 | 1010 |
| [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2575 | 677 | 2854 | 1274 |
| [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2595 | 749 | 2836 | 1291 |
| [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1586 | 432 | 1714 | 675 |
| [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1552 | 421 | 1690 | 1770 |
| [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1680 | 424 | 1690 | 837 |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
| TOTAL | 36928 | 13167 | 43267 | 22386 |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
```
Differential Revision: D17598249
Pulled By: ljk53
fbshipit-source-id: aaeea221494f11b153a35af2b818a603f1f32ddf