Use Caffe2's implementation of grouped depthwise 3x3 convolutions (#26556)
Summary:
Use Caffe2's implementation of grouped depthwise 3x3 convolutions instead of NNPACK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26556
Test Plan:
_Correctness_ - Manually check the results using the --print-output flag on speed_benchmark_torch.
_Performance_ - All measurements below on Pixel 2
**Before**:
Multi-threaded:
> adb shell "./speed_benchmark_torch \
> --model=./xraymobilev3.pt \
> --input_dims="1,3,224,224" \
> --input_type=float --warmup=5 \
> --iter=25"
>
> Main run finished. Milliseconds per iter: **876.002**. Iters per second: 1.14155
Single-threaded:
> adb shell "./speed_benchmark_torch \
> --model=./xraymobilev3.pt \
> --input_dims="1,3,224,224" \
> --input_type=float --warmup=5 \
> --iter=25
> --caffe2_threadpool_force_inline=true"
>
> Main run finished. Milliseconds per iter: **459.409**. Iters per second: 2.17671
**After**:
Multi-threaded:
> adb shell "./speed_benchmark_torch \
> --model=./xraymobilev3.pt \
> --input_dims="1,3,224,224" \
> --input_type=float --warmup=5 \
> --iter=25
>
> Main run finished. Milliseconds per iter: **285.68**. Iters per second: 3.50042
Single-threaded:
> adb shell "./speed_benchmark_torch \
> --model=./xraymobilev3.pt \
> --input_dims="1,3,224,224" \
> --input_type=float --warmup=5 \
> --iter=25
> --caffe2_threadpool_force_inline=true"
> Main run finished. Milliseconds per iter: **278.999**. Iters per second: 3.58425
>
Differential Revision: D17533311
Pulled By: AshkanAliabadi
fbshipit-source-id: 9ee8acf02b8e3e8da1922b188ed0a6459a90b67d