[PyTorch Edge] Parallelize Quantize and Dequantize Tensor (#65845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65845
Benchmarking of Non-Parallelized and Parallelized quantization/dequantization for various devices and input sizes done in this notebook:
https://www.internalfb.com/intern/anp/view/?id=1204834&scroll_cell=17&checkpoint_id=432447238302644
For example:
- {F671713127}
- {F671713209}
- {F671713238}
- {F671713253}
When run on Partially Quantized Mobile Vision Transformer Model (as described in D31066997:
Before this diff (on D31444248 v7):
- [120.907ms](https://our.intern.facebook.com/intern/aibench/details/945891590820680)
With this diff (v19):
- Threshold = 2^16: [118.086ms](https://our.intern.facebook.com/intern/aibench/details/436376817372377)
- Threshold = 2^20: [118.361ms](https://our.intern.facebook.com/intern/aibench/details/617543354077290)
ghstack-source-id: 142166374
Test Plan:
Same as previous diff (D31066997)
All tests pass
Also, set numel to 2^21 in quantized_test TestArmVectorizedAndParallelQuantizeDequantize (https://www.internalfb.com/diff/D31066997?dst_version_fbid=596325738080019&transaction_fbid=219437170135898) and the tests passed
Reviewed By: kimishpatel
Differential Revision: D31205883
fbshipit-source-id: 9ed0b11a376734feaf228074a24b8eb79d5270a3