Speed up HistogramObserver by vectorizing critical path (#41041)
Summary:
22x speedup over the code this replaces. Tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41041
Test Plan:
To run the test against the reference (old) implementation, you can use `python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference`.
To run the benchmark, while in the folder `benchmarks/operator_benchmark`, you can use `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.
Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566
# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```
Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241
# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```
Reviewed By: raghuramank100
Differential Revision: D22400755
Pulled By: durumu
fbshipit-source-id: 639ac796a554710a33c8a930c1feae95a1148718