pytorch
68a9057a - [PyTorch Edge] Add Optimized QInt8 Quantize Tensor Arm (#76245)

Commit
2 years ago
[PyTorch Edge] Add Optimized QInt8 Quantize Tensor Arm (#76245) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76245 The implementation is very similar to that of the QUInt8 version Test Plan: From Clone of Open Source PyTorch: - BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_LITE_INTERPRETER=0 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON Send binary to android device and run it - Test with ```build_android/bin/quantized_test``` - Benchmark with ```build_android/bin/quantize_per_channel``` (after changes in D35616898) ___ Benchmark Results: Benchmark on Body Keypoint Model (as in D35616898) - Before: [21.0584 ms](https://www.internalfb.com/intern/aibench/details/14343432029716) - After [11.8182 ms](https://www.internalfb.com/intern/aibench/details/697250961900934) Benchmark in isolation over a variety of input shapes: - Before: P495061553 - After: P495058591 Graphs generated by: https://www.internalfb.com/intern/anp/view/?id=1798160&revision_id=1018261229074723 Average speedup over all C and N: 3.27x {F722742346} {F722742351} {F722742345} {F722742353} {F722742352} {F722742350} {F722742347} ___ Test Results: ``` quantized_test: 1 file pushed. 11.8 MB/s (1261058776 bytes in 102.295s) Running main() from ../third_party/googletest/googletest/src/gtest_main.cc [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from TestQTensor [ RUN ] TestQTensor.QuantDequantAPIs [ OK ] TestQTensor.QuantDequantAPIs (2 ms) [ RUN ] TestQTensor.RoundingMode [ OK ] TestQTensor.RoundingMode (0 ms) [ RUN ] TestQTensor.Item [ OK ] TestQTensor.Item (0 ms) [ RUN ] TestQTensor.EmptyQuantized [ OK ] TestQTensor.EmptyQuantized (0 ms) [ RUN ] TestQTensor.EmptyPerchannelQuantized [ OK ] TestQTensor.EmptyPerchannelQuantized (0 ms) [ RUN ] TestQTensor.QuantizePerChannel4d [ OK ] TestQTensor.QuantizePerChannel4d (0 ms) [ RUN ] TestQTensor.QuantizePerChannel4dChannelsLast [ OK ] TestQTensor.QuantizePerChannel4dChannelsLast (10 ms) [ RUN ] TestQTensor.FromBlobQuantizedPerTensor [ OK ] TestQTensor.FromBlobQuantizedPerTensor (0 ms) [ RUN ] TestQTensor.FromBlobQuantizedPerChannel [ OK ] TestQTensor.FromBlobQuantizedPerChannel (0 ms) [ RUN ] TestQTensor.TestArmVectorizedQuantizeDequantize [ OK ] TestQTensor.TestArmVectorizedQuantizeDequantize (0 ms) [----------] 10 tests from TestQTensor (15 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (15 ms total) [ PASSED ] 10 tests. ``` Reviewed By: kimishpatel Differential Revision: D35283670 fbshipit-source-id: b8fd72186c53956de808ea0426c0aa0abc3eb348 (cherry picked from commit 0496af31e9664f85bca64a592aa66b9d3ed0d846)
Author
Committer
Parents
Loading