Use `gpu_kernel` in Affine Quantizer (#37312)
Summary:
Removes `CUDA_tensor_apply2` from Affine Quantizer.
cc: zasdfgbnm
# Profiling
## This PR
### quint8
```==4458== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 4.8703ms 20 243.52us 207.60us 312.66us quantize_per_tensor, seq = 0
GPU activities: 100.00% 751.95us 10 75.194us 74.372us 79.044us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE0_clEvEUlfN3c106quint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
API calls: 100.00% 162.48us 10 16.247us 13.383us 35.997us cudaLaunchKernel
```
### qint8
```==14289== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 4.8143ms 20 240.71us 155.68us 327.78us quantize_per_tensor, seq = 0
GPU activities: 100.00% 748.85us 10 74.884us 73.892us 78.565us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE_clEvEUlfN3c105qint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
API calls: 100.00% 166.61us 10 16.661us 13.387us 39.237us cudaLaunchKernel
```
### qint32
```
==17303== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 19.011ms 20 950.55us 308.07us 1.0331ms quantize_per_tensor, seq = 0
GPU activities: 100.00% 1.1440ms 10 114.40us 113.42us 117.74us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE1_clEvEUlfN3c106qint32EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_
API calls: 100.00% 163.78us 10 16.378us 13.747us 35.668us cudaLaunchKernel
```
## Original
commit: b428f454e13f6e8055124ea19c32b554017137d0
### quint8
```
==4361== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 5.6212ms 20 281.06us 230.17us 352.82us quantize_per_tensor, seq = 0
GPU activities: 100.00% 780.85us 10 78.084us 77.633us 78.561us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE0_clEvEUlRfRN3c106quint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
API calls: 100.00% 166.07us 10 16.606us 13.535us 36.578us cudaLaunchKernel
```
### qint8
```
==12583== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 5.5765ms 20 278.82us 226.51us 351.23us quantize_per_tensor, seq = 0
GPU activities: 100.00% 783.28us 10 78.328us 77.826us 80.386us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE_clEvEUlRfRN3c105qint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
API calls: 100.00% 161.05us 10 16.104us 13.363us 34.284us cudaLaunchKernel
```
### qint32
```
==17267== Range "quantize_per_tensor, seq = 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 19.815ms 20 990.77us 381.03us 1.0717ms quantize_per_tensor, seq = 0
GPU activities: 100.00% 1.1778ms 10 117.78us 117.51us 118.44us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE1_clEvEUlRfRN3c106qint32EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_
API calls: 100.00% 172.26us 10 17.226us 14.094us 37.952us cudaLaunchKernel
```
##
# Environment
```shell
Collecting environment information...
PyTorch version: 1.6.0a0+010771e
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.14.0
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: TITAN V
Nvidia driver version: 440.33.01
cuDNN version: /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.6.0a0+010771e
[conda] blas 1.0 mkl
[conda] magma-cuda102 2.5.2 1 pytorch
[conda] mkl 2020.0 166
[conda] mkl-include 2020.0 166
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] torch 1.6.0a0+010771e dev_0 <develop>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37312
Differential Revision: D21383938
Pulled By: jerryzh168
fbshipit-source-id: 21539675267c64508a6b9eafcde1a8861d1fb421