[PyTorch Edge] Add Optimized QInt8 Quantize Tensor Arm (#76245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76245
The implementation is very similar to that of the QUInt8 version
Test Plan:
From Clone of Open Source PyTorch:
- BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_LITE_INTERPRETER=0 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
Send binary to android device and run it
- Test with ```build_android/bin/quantized_test```
- Benchmark with ```build_android/bin/quantize_per_channel``` (after changes in D35616898)
___
Benchmark Results:
Benchmark on Body Keypoint Model (as in D35616898)
- Before: [21.0584 ms](https://www.internalfb.com/intern/aibench/details/14343432029716)
- After [11.8182 ms](https://www.internalfb.com/intern/aibench/details/697250961900934)
Benchmark in isolation over a variety of input shapes:
- Before: P495061553
- After: P495058591
Graphs generated by: https://www.internalfb.com/intern/anp/view/?id=1798160&revision_id=1018261229074723
Average speedup over all C and N: 3.27x
{F722742346}
{F722742351}
{F722742345}
{F722742353}
{F722742352}
{F722742350}
{F722742347}
___
Test Results:
```
quantized_test: 1 file pushed. 11.8 MB/s (1261058776 bytes in 102.295s)
Running main() from ../third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from TestQTensor
[ RUN ] TestQTensor.QuantDequantAPIs
[ OK ] TestQTensor.QuantDequantAPIs (2 ms)
[ RUN ] TestQTensor.RoundingMode
[ OK ] TestQTensor.RoundingMode (0 ms)
[ RUN ] TestQTensor.Item
[ OK ] TestQTensor.Item (0 ms)
[ RUN ] TestQTensor.EmptyQuantized
[ OK ] TestQTensor.EmptyQuantized (0 ms)
[ RUN ] TestQTensor.EmptyPerchannelQuantized
[ OK ] TestQTensor.EmptyPerchannelQuantized (0 ms)
[ RUN ] TestQTensor.QuantizePerChannel4d
[ OK ] TestQTensor.QuantizePerChannel4d (0 ms)
[ RUN ] TestQTensor.QuantizePerChannel4dChannelsLast
[ OK ] TestQTensor.QuantizePerChannel4dChannelsLast (10 ms)
[ RUN ] TestQTensor.FromBlobQuantizedPerTensor
[ OK ] TestQTensor.FromBlobQuantizedPerTensor (0 ms)
[ RUN ] TestQTensor.FromBlobQuantizedPerChannel
[ OK ] TestQTensor.FromBlobQuantizedPerChannel (0 ms)
[ RUN ] TestQTensor.TestArmVectorizedQuantizeDequantize
[ OK ] TestQTensor.TestArmVectorizedQuantizeDequantize (0 ms)
[----------] 10 tests from TestQTensor (15 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (15 ms total)
[ PASSED ] 10 tests.
```
Reviewed By: kimishpatel
Differential Revision: D35283670
fbshipit-source-id: b8fd72186c53956de808ea0426c0aa0abc3eb348
(cherry picked from commit 0496af31e9664f85bca64a592aa66b9d3ed0d846)