mem-efficient learnable fake quantization (#49315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49315
Update the learnable fake-quantization to use c++ and CUDA kernels, and resolve some issues on using it with pytorch DDP.
The updated quantization operator have a different gradient calculation for scale and zero_point when the output is at the endpoints of clamp operation. The updated quantization operator calculates the gradient according to the gradient of the `clamp` function. This behavior is consistent with the gradient calculation for non-learnable fake-quant ops.
ghstack-source-id: 120821868
Test Plan:
# learnable_fake_quantization forward/backward op test
## Unit Test:
`buck test mode/dev-nosan -c fbcode.platform=platform009 //caffe2/test:quantization -- -v TestFakeQuantize`
## Benchmark Test:
`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerTensorOpBenchmark`
`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerChannelOpBenchmark`
### In **microseconds** (`1e-6` second),
References: P171624031
input size: [1, 3, 256, 256]
| | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward | 1372.123 | 1365.981 |
| Per Tensor Cuda Forward | 84.586 | 27.205|
| Per Channel CPU Forward | 2306.668 | 2299.991|
| Per Channel Cuda Forward | 154.742 | 135.219 |
| Per Tensor CPU Backward | 2544.617 | 581.268|
| Per Tensor Cuda Backward | 304.529 | 137.335|
| Per Channel CPU Backward | 3328.188 |582.088 |
| Per Channel Cuda Backward | 504.176 | 134.082|
input size: [1, 3, 512, 512]
| | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward | 5426.244 | 5726.440 |
| Per Tensor Cuda Forward | 85.834 | 26.871|
| Per Channel CPU Forward | 9125.913 | 9118.152|
| Per Channel Cuda Forward | 159.599 | 145.117 |
| Per Tensor CPU Backward | 14020.830 | 2214.864|
| Per Tensor Cuda Backward | 285.525 | 131.302|
| Per Channel CPU Backward | 16977.141 |2104.345 |
| Per Channel Cuda Backward | 541.511 | 120.222|
# use learnable_fake_quantization in AI-denoising QAT:
f229412681
Reviewed By: raghuramank100
Differential Revision: D24479735
fbshipit-source-id: 5275596f3ce8200525f4d9d07d0c913afdf8b43a