[FX] Prototype Conv/BN fuser in FX (#47657)
Summary:
Some interesting stuff going on. All benchmarks are tested with both my implementation as well as the current quantized fuser.
For these benchmarks, things like using MKLDNN/FBGEMM make a big differene.
## Manual compilation (everything turned off)
In the small case, things look good
```
non-fused: 1.174886703491211
fused: 0.7494957447052002
```
However, for `torchvision.resnet18`, we see
```
non-fused: 1.2272708415985107
fused: 3.7183213233947754
```
This is because Conv (no bias) -> Batch Norm is actually faster than Conv (bias) if you don't have any libraries...
## Nightly (CPU)
```
Toy
non-fused: 0.45807552337646484
fused: 0.34779977798461914
resnet18
non-fused: 0.14216232299804688
fused: 0.13438796997070312
resnet50
non-fused: 0.2999534606933594
fused: 0.29364800453186035
densenet161
non-fused: 0.6558926105499268
fused: 0.6190280914306641
inception_v3
non-fused: 1.2804391384124756
fused: 1.181272029876709
```
with MKLDNN.
We see a small performance gain across the board, with more significant performance gains for smaller models.
## Nightly (CUDA)
```
M
non-fused: 1.2220964431762695
fused: 1.0833759307861328
resnet18
non-fused: 0.09721899032592773
fused: 0.09089207649230957
resnet50
non-fused: 0.2053072452545166
fused: 0.19138741493225098
densenet161
non-fused: 0.6830024719238281
fused: 0.660109281539917
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47657
Reviewed By: eellison
Differential Revision: D25127546
Pulled By: Chillee
fbshipit-source-id: ecdf682038def046045fcc09faf9aeb6c459b5e3