quantize_tensor_per_channel ARM implementation (#46018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46018
Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.
Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```
Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)
Reviewed By: kimishpatel
Differential Revision: D24286528
fbshipit-source-id: 5481dcbbff8345a2c0d6cc9b7d7f8075fbff03b3