Add optimized quantize function for ARM (#26867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26867
Use caffe2::Int8Quantize for pytorch mobile. Currently this is only implemented for uint8 tensors and runs using NEON intrinsics.
For all other cases it falls back to naive pytorch quantize_val implementation.
Previously, naive implementation of quantize_val is slow on mobile, taking up more than 50% of the execution time.
Results
Before
aten::quantize_per_tensor 42.893 ms
Total model runtime 70.5ms
After
aten::quantize_per_tensor 0.340 ms
Total model runtime 27.5ms
Test Plan:
Tested current python tests work python test/test_quantized.py TestQNNPackOps
Also tested using quantized mobilenetV2 on mobile and compared output
Imported from OSS
Differential Revision: D17638732
fbshipit-source-id: 76445d1e415e6e502d05ba5b900e5e1d875fc1b0