pytorch
1d43d7ca - Use `gpu_kernel` in Affine Quantizer (#37312)

Commit View On GitHub

Commit

4 years ago

Use `gpu_kernel` in Affine Quantizer (#37312) Summary: Removes `CUDA_tensor_apply2` from Affine Quantizer. cc: zasdfgbnm # Profiling ## This PR ### quint8 ```==4458== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 4.8703ms 20 243.52us 207.60us 312.66us quantize_per_tensor, seq = 0 GPU activities: 100.00% 751.95us 10 75.194us 74.372us 79.044us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE0_clEvEUlfN3c106quint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_ API calls: 100.00% 162.48us 10 16.247us 13.383us 35.997us cudaLaunchKernel ``` ### qint8 ```==14289== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 4.8143ms 20 240.71us 155.68us 327.78us quantize_per_tensor, seq = 0 GPU activities: 100.00% 748.85us 10 74.884us 73.892us 78.565us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE_clEvEUlfN3c105qint8EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_ API calls: 100.00% 166.61us 10 16.661us 13.387us 39.237us cudaLaunchKernel ``` ### qint32 ``` ==17303== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 19.011ms 20 950.55us 308.07us 1.0331ms quantize_per_tensor, seq = 0 GPU activities: 100.00% 1.1440ms 10 114.40us 113.42us 117.74us _ZN2at6native6modern29vectorized_elementwise_kernelILi4EZZZNS0_75_GLOBAL__N__51_tmpxft_0000424b_00000000_6_affine_quantizer_cpp1_ii_92f2f7d738quantize_tensor_per_tensor_affine_cudaENS_6TensorES4_dlENKUlvE_clEvENKUlvE1_clEvEUlfN3c106qint32EE_NS_6detail5ArrayIPcLi3EEEEEviT0_T1_ API calls: 100.00% 163.78us 10 16.378us 13.747us 35.668us cudaLaunchKernel ``` ## Original commit: b428f454e13f6e8055124ea19c32b554017137d0 ### quint8 ``` ==4361== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 5.6212ms 20 281.06us 230.17us 352.82us quantize_per_tensor, seq = 0 GPU activities: 100.00% 780.85us 10 78.084us 77.633us 78.561us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE0_clEvEUlRfRN3c106quint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_ API calls: 100.00% 166.07us 10 16.606us 13.535us 36.578us cudaLaunchKernel ``` ### qint8 ``` ==12583== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 5.5765ms 20 278.82us 226.51us 351.23us quantize_per_tensor, seq = 0 GPU activities: 100.00% 783.28us 10 78.328us 77.826us 80.386us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE_clEvEUlRfRN3c105qint8EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_ API calls: 100.00% 161.05us 10 16.104us 13.363us 34.284us cudaLaunchKernel ``` ### qint32 ``` ==17267== Range "quantize_per_tensor, seq = 0" Type Time(%) Time Calls Avg Min Max Name Range: 100.00% 19.815ms 20 990.77us 381.03us 1.0717ms quantize_per_tensor, seq = 0 GPU activities: 100.00% 1.1778ms 10 117.78us 117.51us 118.44us _ZN2at4cuda75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7721kernelPointwiseApply2IZZZNS_6native75_GLOBAL__N__51_tmpxft_00007fda_00000000_6_affine_quantizer_cpp1_ii_13ee0d7738quantize_tensor_per_tensor_affine_cudaENS_6TensorES5_dlENKUlvE_clEvENKUlvE1_clEvEUlRfRN3c106qint32EE_fSA_jLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSE_IT1_SG_EESG_T_ API calls: 100.00% 172.26us 10 17.226us 14.094us 37.952us cudaLaunchKernel ``` ## # Environment ```shell Collecting environment information... PyTorch version: 1.6.0a0+010771e Is debug build: No CUDA used to build PyTorch: 10.2 OS: Ubuntu 18.04.3 LTS GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 CMake version: version 3.14.0 Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: TITAN V Nvidia driver version: 440.33.01 cuDNN version: /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7 Versions of relevant libraries: [pip] numpy==1.18.1 [pip] torch==1.6.0a0+010771e [conda] blas 1.0 mkl [conda] magma-cuda102 2.5.2 1 pytorch [conda] mkl 2020.0 166 [conda] mkl-include 2020.0 166 [conda] mkl-service 2.3.0 py37he904b0f_0 [conda] mkl_fft 1.0.15 py37ha843d7b_0 [conda] mkl_random 1.1.0 py37hd6b4f25_0 [conda] torch 1.6.0a0+010771e dev_0 <develop> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/37312 Differential Revision: D21383938 Pulled By: jerryzh168 fbshipit-source-id: 21539675267c64508a6b9eafcde1a8861d1fb421

Author

crcrpar

Committer

facebook-github-bot

Parents

847d102e

pytorch 1d43d7ca - Use `gpu_kernel` in Affine Quantizer (#37312)

Commit

pytorch
1d43d7ca - Use `gpu_kernel` in Affine Quantizer (#37312)