[pt][quant] Add vector path to copy kernel for quantized data types (#36189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36189
We only had a scalar path for the copy kernel for quantized data types. This diff adds a vector path. It should improve all the ops where copy is used. This results in 10x better performance for mul_scalar in one of the benchmarked models.
### Before:
```
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
quantize_per_tensor 0.16% 171.287us 0.16% 171.287us 171.287us 1
quantized::conv2d 56.65% 58.830ms 56.65% 58.830ms 387.040us 152
quantized::add_scalar 6.02% 6.256ms 6.02% 6.256ms 67.270us 93
quantized::relu6 2.04% 2.121ms 2.04% 2.121ms 22.808us 93
quantized::mul_scalar 19.33% 20.076ms 19.33% 20.076ms 215.876us 93
quantized::mul 13.79% 14.320ms 13.79% 14.320ms 124.520us 115
quantized::add 1.17% 1.215ms 1.17% 1.215ms 43.388us 28
adaptive_avg_pool2d 0.04% 41.684us 0.64% 661.083us 28.743us 23
_adaptive_avg_pool2d 0.60% 619.399us 0.60% 619.399us 26.930us 23
sigmoid 0.17% 180.745us 0.17% 180.745us 8.216us 22
dropout 0.00% 1.798us 0.00% 1.798us 1.798us 1
view 0.01% 8.529us 0.01% 8.529us 8.529us 1
dequantize 0.01% 7.481us 0.01% 7.481us 7.481us 1
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 103.849ms
```
### After:
```
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
quantize_per_tensor 0.23% 193.581us 0.23% 193.581us 193.581us 1
quantized::conv2d 68.66% 58.702ms 68.66% 58.702ms 386.197us 152
quantized::add_scalar 7.11% 6.082ms 7.11% 6.082ms 65.401us 93
quantized::relu6 2.40% 2.056ms 2.40% 2.056ms 22.104us 93
quantized::mul_scalar 2.34% 2.001ms 2.34% 2.001ms 21.513us 93
quantized::mul 16.85% 14.410ms 16.85% 14.410ms 125.308us 115
quantized::add 1.34% 1.149ms 1.34% 1.149ms 41.033us 28
adaptive_avg_pool2d 0.05% 46.415us 0.78% 667.620us 29.027us 23
_adaptive_avg_pool2d 0.73% 621.205us 0.73% 621.205us 27.009us 23
sigmoid 0.25% 215.650us 0.25% 215.650us 9.802us 22
dropout 0.00% 2.503us 0.00% 2.503us 2.503us 1
view 0.01% 11.608us 0.01% 11.608us 11.608us 1
dequantize 0.01% 9.221us 0.01% 9.221us 9.221us 1
------------------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 85.500ms
```
Test Plan: buck test //caffe2/test:quantization -- 'test_qtensor_copy' --print-passing-details
Reviewed By: jspark1105
Differential Revision: D20906956
fbshipit-source-id: d538b8dc0d031ce61cb1b0af14a1c012976d75b1