pytorch
f8b084c5 - dbr quant overhead[1/x]: remove expensive calls to named_modules (#68309)

Commit
3 years ago
dbr quant overhead[1/x]: remove expensive calls to named_modules (#68309) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68309 This is the first of a series of PRs to reduce overhead of DBR quantization prototype. For now, the measurement of this work is not super scientific as there are a lot of low hanging fruit. As we speed up the prototype, we might need to invest in better benchmarking. Current benchmarking setup: * mac OS laptop with OMP_NUM_THREADS=1 * torchvision's mobilenet_v2 * input size 1x3x224x224 * we measure fp32 forward, prepared and quantized forward with FX quant vs DBR quant Note that due to small input size, this benchmark is pretty noisy. The goal here is to measure overhead of DBR quant logic (not the kernels), so small input is good as we want the kernels to take as little % of overall time as possible. High level goal is for DBR quant convert forward to approach the FX time. This first PR removes the expensive named_modules calls and resets the op counter in the op instead. According to cProf, this should be a 2 to 3 pct win. Test Plan: ``` benchmark: https://gist.github.com/vkuzo/1a4f98ca541161704ee3c305d7740d4a // before fp32: 0.020101 seconds avg fx_prepared: 0.020915 seconds avg, 0.961083 speedup vs fp32 fx_quantized: 0.012037 seconds avg, 1.670005 speedup vs fp32 dt_prepared: 0.037506 seconds avg, 0.535953 speedup vs fp32 dt_quantized: 0.022688 seconds avg, 0.885988 speedup vs fp32 // after fp32: 0.020722 seconds avg fx_prepared: 0.023417 seconds avg, 0.884893 speedup vs fp32 fx_quantized: 0.014834 seconds avg, 1.396942 speedup vs fp32 dt_prepared: 0.039120 seconds avg, 0.529700 speedup vs fp32 dt_quantized: 0.020063 seconds avg, 1.032831 speedup vs fp32 ``` Reviewed By: albanD Differential Revision: D32463753 Pulled By: vkuzo fbshipit-source-id: 1d7de7d9c4837e2b0ec815f0f67014c7600bb16c
Author
Parents
Loading