pytorch
7332c21f - Speed up HistogramObserver by vectorizing critical path (#41041)

Commit
4 years ago
Speed up HistogramObserver by vectorizing critical path (#41041) Summary: 22x speedup over the code this replaces. Tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins). Pull Request resolved: https://github.com/pytorch/pytorch/pull/41041 Test Plan: To run the test against the reference (old) implementation, you can use `python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference`. To run the benchmark, while in the folder `benchmarks/operator_benchmark`, you can use `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`. Benchmark results before speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 185818.566 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 165325.916 ``` Benchmark results after speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 12242.241 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 12655.354 ``` Reviewed By: raghuramank100 Differential Revision: D22400755 Pulled By: durumu fbshipit-source-id: 639ac796a554710a33c8a930c1feae95a1148718
Author
Parents
Loading