[pt][quant] Optimized qadd_scalar (#34925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34925
Optimized path for qadd scalar. qadd_scalar time goes down from 55.840ms for a model to 4.637ms.
### Before
```
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
quantize_per_tensor 0.12% 155.807us 0.12% 155.807us 155.807us 1
quantized::conv2d 25.50% 31.981ms 25.50% 31.981ms 273.343us 117
quantized::add_scalar 44.53% 55.840ms 44.53% 55.840ms 809.281us 69
quantized::relu6 1.25% 1.570ms 1.25% 1.570ms 22.749us 69
quantized::mul_scalar 10.73% 13.449ms 10.73% 13.449ms 194.914us 69
quantized::mul 16.67% 20.904ms 16.67% 20.904ms 227.220us 92
adaptive_avg_pool2d 0.03% 41.713us 0.69% 862.922us 35.955us 24
_adaptive_avg_pool2d 0.65% 821.209us 0.65% 821.209us 34.217us 24
sigmoid 0.15% 182.344us 0.15% 182.344us 7.928us 23
quantized::add 0.34% 431.939us 0.34% 431.939us 26.996us 16
dropout 0.00% 1.936us 0.00% 1.936us 1.936us 1
view 0.01% 10.281us 0.01% 10.281us 10.281us 1
dequantize 0.00% 4.562us 0.00% 4.562us 4.562us 1
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 125.394ms
```
### After
```
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
quantize_per_tensor 0.18% 130.534us 0.18% 130.534us 130.534us 1
quantized::conv2d 42.29% 31.267ms 42.29% 31.267ms 267.243us 117
quantized::add_scalar 6.27% 4.637ms 6.27% 4.637ms 67.205us 69
quantized::relu6 1.77% 1.312ms 1.77% 1.312ms 19.008us 69
quantized::mul_scalar 18.92% 13.991ms 18.92% 13.991ms 202.768us 69
quantized::mul 28.49% 21.059ms 28.49% 21.059ms 228.904us 92
adaptive_avg_pool2d 0.06% 45.242us 1.27% 942.522us 39.272us 24
_adaptive_avg_pool2d 1.21% 897.280us 1.21% 897.280us 37.387us 24
sigmoid 0.22% 160.282us 0.22% 160.282us 6.969us 23
quantized::add 0.56% 416.276us 0.56% 416.276us 26.017us 16
dropout 0.00% 1.245us 0.00% 1.245us 1.245us 1
view 0.01% 7.122us 0.01% 7.122us 7.122us 1
dequantize 0.01% 5.952us 0.01% 5.952us 5.952us 1
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 73.930ms
```
ghstack-source-id: 100595212
Test Plan: buck test //caffe2/test:quantized -- 'test_qadd' --print-passing-details
Differential Revision: D20500848
fbshipit-source-id: c292d15da121e6d13cc4eb92f10549874ff6ab0f