Add benchmark for per channel tensor quantization (#46017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46017
Currently on mobile only per tensor quantization is optimized for mobile using ARM intrinsics. This benchmark is
dded to help gauge performance improvement on mobile after performing the same optimizations for per channel quantization.
Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```
Reviewed By: kimishpatel
Differential Revision: D24286488
fbshipit-source-id: 1e7942f0bb3d9d1fe172409d522be9f351a485bd