pytorch
0c60922f - mem-efficient learnable fake quantization (#49315)

Commit
3 years ago
mem-efficient learnable fake quantization (#49315) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49315 Update the learnable fake-quantization to use c++ and CUDA kernels, and resolve some issues on using it with pytorch DDP. The updated quantization operator have a different gradient calculation for scale and zero_point when the output is at the endpoints of clamp operation. The updated quantization operator calculates the gradient according to the gradient of the `clamp` function. This behavior is consistent with the gradient calculation for non-learnable fake-quant ops. ghstack-source-id: 120821868 Test Plan: # learnable_fake_quantization forward/backward op test ## Unit Test: `buck test mode/dev-nosan -c fbcode.platform=platform009 //caffe2/test:quantization -- -v TestFakeQuantize` ## Benchmark Test: `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerTensorOpBenchmark` `buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerChannelOpBenchmark` ### In **microseconds** (`1e-6` second), References: P171624031 input size: [1, 3, 256, 256] | | C++ Kernel | Non-backprop C++ Kernel | |---------------------------|---------------|------------|-------------------------|---| | Per Tensor CPU Forward | 1372.123 | 1365.981 | | Per Tensor Cuda Forward | 84.586 | 27.205| | Per Channel CPU Forward | 2306.668 | 2299.991| | Per Channel Cuda Forward | 154.742 | 135.219 | | Per Tensor CPU Backward | 2544.617 | 581.268| | Per Tensor Cuda Backward | 304.529 | 137.335| | Per Channel CPU Backward | 3328.188 |582.088 | | Per Channel Cuda Backward | 504.176 | 134.082| input size: [1, 3, 512, 512] | | C++ Kernel | Non-backprop C++ Kernel | |---------------------------|---------------|------------|-------------------------|---| | Per Tensor CPU Forward | 5426.244 | 5726.440 | | Per Tensor Cuda Forward | 85.834 | 26.871| | Per Channel CPU Forward | 9125.913 | 9118.152| | Per Channel Cuda Forward | 159.599 | 145.117 | | Per Tensor CPU Backward | 14020.830 | 2214.864| | Per Tensor Cuda Backward | 285.525 | 131.302| | Per Channel CPU Backward | 16977.141 |2104.345 | | Per Channel Cuda Backward | 541.511 | 120.222| # use learnable_fake_quantization in AI-denoising QAT: f229412681 Reviewed By: raghuramank100 Differential Revision: D24479735 fbshipit-source-id: 5275596f3ce8200525f4d9d07d0c913afdf8b43a
Author
Parents
Loading