[quant][core][performance] Removed int_repr calls in quantized conv2d cudnn implementation (#73849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73849
This PR removes the int_repr() calls for the activation and weight tensors.
Rather than using int8 tensor, we use the qint8 tensor directly as, fundamentaly,
the two tensors are equivalent except qint8 tensor has a qconfig. This avoids
a copy of the qint8 tensor and significantly increases efficiency.
Test Plan:
In pytorch main directory, execute
```
python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn
```
for accuracy testing and
```
python test/test_quantization.py TestQuantizedConv.test_benchmark
```
for benchmark testing.
Previous int8 benchmark:
int8 benchmark result:
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
quantized::conv2d 99.37% 2.408s 99.44% 2.410s 120.500ms 0.000us 0.00% 6.142ms 307.100us 20
cudaDeviceSynchronize 0.48% 11.747ms 0.48% 11.747ms 11.747ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 0.07% 1.731ms 99.51% 2.412s 120.587ms 0.000us 0.00% 6.142ms 307.100us 20
aten::empty 0.02% 501.000us 0.02% 501.000us 3.579us 0.000us 0.00% 0.000us 0.000us 140
cudaLaunchKernel 0.02% 452.000us 0.02% 452.000us 7.533us 0.000us 0.00% 0.000us 0.000us 60
aten::int_repr 0.01% 351.000us 0.04% 886.000us 22.150us 2.700ms 12.93% 2.700ms 67.500us 40
aten::_empty_affine_quantized 0.01% 172.000us 0.01% 172.000us 8.600us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.01% 139.000us 0.01% 254.000us 12.700us 3.442ms 16.49% 3.442ms 172.100us 20
aten::q_scale 0.00% 62.000us 0.00% 62.000us 1.550us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.00% 61.000us 0.00% 112.000us 5.600us 0.000us 0.00% 0.000us 0.000us 20
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.424s
Self CUDA time total: 20.877ms
```
Current int8 benchmark:
```
int8 benchmark result:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 83.02% 15.241ms 83.02% 15.241ms 15.241ms 0.000us 0.00% 0.000us 0.000us 1
ProfilerStep* 7.54% 1.384ms 16.48% 3.026ms 151.300us 0.000us 0.00% 3.460ms 173.000us 20
quantized::conv2d 4.47% 821.000us 8.89% 1.632ms 81.600us 0.000us 0.00% 3.460ms 173.000us 20
aten::empty 1.43% 262.000us 1.43% 262.000us 2.620us 0.000us 0.00% 0.000us 0.000us 100
cudaLaunchKernel 1.05% 193.000us 1.05% 193.000us 9.650us 0.000us 0.00% 0.000us 0.000us 20
aten::fill_ 0.89% 164.000us 1.94% 357.000us 17.850us 3.460ms 19.64% 3.460ms 173.000us 20
aten::_empty_affine_quantized 0.86% 157.000us 0.86% 157.000us 7.850us 0.000us 0.00% 0.000us 0.000us 20
aten::q_scale 0.32% 59.000us 0.32% 59.000us 1.475us 0.000us 0.00% 0.000us 0.000us 40
aten::zeros 0.29% 53.000us 0.50% 92.000us 4.600us 0.000us 0.00% 0.000us 0.000us 20
cudaEventRecord 0.11% 20.000us 0.11% 20.000us 1.000us 0.000us 0.00% 0.000us 0.000us 20
Self CPU time total: 18.116ms
Self CUDA time total: 17.612ms
```
Reviewed By: jerryzh168
Differential Revision: D34824248
Pulled By: dzdang
fbshipit-source-id: f1a558b50d1c9f8f30e1714d3a4667d929fc72ba
(cherry picked from commit e52ce623b3a56239d28de3d32df79c1491e717ff)