Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU and GPU) (#42384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42384
In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`).
In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
- original python operator: 1021037 microseconds
- original learnable kernel: 407576 microseconds
- optimized learnable kernel: 102584 microseconds
- original non-backprop kernel: 139806 microseconds
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~4x
**Speedup from non-backprop kernel**: ~1.2x
Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command
`buck test //caffe2/test:quantization -- learnable_backward_per_tensor`
To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs are as follows:
(CPU)
```
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 1021036.957
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 102583.693
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 139806.086
```
(GPU)
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: py_module
Backward Execution Time (us) : 6548.350
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: learnable_kernel
Backward Execution Time (us) : 1340.724
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: original_kernel
Backward Execution Time (us) : 656.863
```
Reviewed By: vkuzo
Differential Revision: D22875998
fbshipit-source-id: cfcd62c327bb622270a783d2cbe97f00508c4a16