dbr quant overhead[1/x]: remove expensive calls to named_modules (#68309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68309
This is the first of a series of PRs to reduce overhead of DBR quantization
prototype. For now, the measurement of this work is not super scientific as
there are a lot of low hanging fruit. As we speed up the prototype, we
might need to invest in better benchmarking.
Current benchmarking setup:
* mac OS laptop with OMP_NUM_THREADS=1
* torchvision's mobilenet_v2
* input size 1x3x224x224
* we measure fp32 forward, prepared and quantized forward with FX quant vs DBR quant
Note that due to small input size, this benchmark is pretty noisy.
The goal here is to measure overhead of DBR quant logic (not the kernels),
so small input is good as we want the kernels to take as little % of overall
time as possible.
High level goal is for DBR quant convert forward to approach the FX time.
This first PR removes the expensive named_modules calls and resets the op
counter in the op instead. According to cProf, this should be a 2 to 3 pct win.
Test Plan:
```
benchmark: https://gist.github.com/vkuzo/1a4f98ca541161704ee3c305d7740d4a
// before
fp32: 0.020101 seconds avg
fx_prepared: 0.020915 seconds avg, 0.961083 speedup vs fp32
fx_quantized: 0.012037 seconds avg, 1.670005 speedup vs fp32
dt_prepared: 0.037506 seconds avg, 0.535953 speedup vs fp32
dt_quantized: 0.022688 seconds avg, 0.885988 speedup vs fp32
// after
fp32: 0.020722 seconds avg
fx_prepared: 0.023417 seconds avg, 0.884893 speedup vs fp32
fx_quantized: 0.014834 seconds avg, 1.396942 speedup vs fp32
dt_prepared: 0.039120 seconds avg, 0.529700 speedup vs fp32
dt_quantized: 0.020063 seconds avg, 1.032831 speedup vs fp32
```
Reviewed By: albanD
Differential Revision: D32463753
Pulled By: vkuzo
fbshipit-source-id: 1d7de7d9c4837e2b0ec815f0f67014c7600bb16c