Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974
In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.
Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`
Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;
### In **microseconds** (`1e-6` second),
| | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward | 3112.666 | 3270.740 | 3596.864 |
| Per Tensor Cuda Forward | 797.258 | 258.961 | 133.953 |
| Per Channel CPU Forward | 6587.693 | 6931.461 | 6352.417 |
| Per Channel Cuda Forward | 1579.576 | 555.723 | 479.016 |
| Per Tensor CPU Backward | 72278.390 | 22466.648 | 12922.195 |
| Per Tensor Cuda Backward | 6512.280 | 1546.218 | 652.942 |
| Per Channel CPU Backward | 74138.545 | 41212.777 | 14131.576 |
| Per Channel Cuda Backward | 6795.173 | 4321.351 | 1052.066 |
Reviewed By: z-a-f
Differential Revision: D22715683
fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd