pytorch
6c48a01c - [Quant] Improve performance of ONEDNN backend (#84470)

Commit
2 years ago
[Quant] Improve performance of ONEDNN backend (#84470) ## Description This PR improves performance of ONEDNN quantization backend by adding fast paths. For qconv, qconv_transpose and qlinear. It uses a cache to store reusable data on the first run thus reducing runtime overhead afterwards. Note: Other quantization backends not affected. ## Validation **Correctness**: Covered by UT **Performance**: (Time to run each op, in microseconds) Convolution, 1 core per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1_ic128oc128_id2od2kd3sd1dd0pd1 _ih8oh8kh3sh1dh0ph1_iw10ow10kw3sw1dw0pw1 | 767.038 | 415.238 | 45.86% mb1_ic256oc128_id4od4kd1sd1dd0pd0 _ih16oh16kh1sh1dh0ph0_iw20ow20kw1sw1dw0pw0 | 194.979 | 167.353 | 14.17% mb1_ic32oc16_ih112oh112kh1sh1dh0ph0 _iw112ow112kw1sw1dw0pw0 | 104.024 | 78.206 | 24.82% mb1_ic3oc16_ih224oh112kh3sh2dh0ph1 _iw224ow112kw3sw2dw0pw1 | 104.178 | 81.559 | 21.71% mb30_ic256oc256_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 12249.438 | 12079.125 | 1.39% mb56_ic3oc28_ih24oh22kh3sh1dh0ph0 _iw24ow22kw3sw1dw0pw0 | 438.046 | 405.21 | 7.50% mb100_ic128oc128_ih16oh16kh3sh1dh0ph1 _iw16ow16kw3sw1dw0pw1 | 13893.188 | 13797.609 | 0.69% g2mb1_ic128oc256_ih28oh28kh3sh1dh0ph1 _iw28ow28kw3sw1dw0pw1 | 499.014 | 475.333 | 4.75% g32mb1_ic1024oc1024_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 294.877 | 270.568 | 8.24% g64mb1_ic1024oc2048_ih14oh7kh3sh2dh0ph1 _iw14ow7kw3sw2dw0pw1 | 122.664 | 95.503 | 22.14% g256mb1_ic256oc256_ih10oh5kh3sh2dh0ph1 _iw10ow5kw3sw2dw0pw1 | 31.343 | 13.522 | 56.86% g512mb1_ic512oc512_ih19oh10kh3sh2dh0ph1 _iw19ow10kw3sw2dw0pw1 | 54.116 | 34.651 | 35.97% g1024mb1_ic1024oc1024_ih10oh10kh3sh1dh0ph1 _iw10ow10kw3sw1dw0pw1 | 74.989 | 55.566 | 25.90% Convolution, 4 cores per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1_ic128oc128_id2od2kd3sd1dd0pd1 _ih8oh8kh3sh1dh0ph1_iw10ow10kw3sw1dw0pw1 | 249.978 | 160.429 | 35.82% mb1_ic256oc128_id4od4kd1sd1dd0pd0 _ih16oh16kh1sh1dh0ph0_iw20ow20kw1sw1dw0pw0 | 102.726 | 89.555 | 12.82% mb1_ic32oc16_ih112oh112kh1sh1dh0ph0 _iw112ow112kw1sw1dw0pw0 | 72.993 | 57.622 | 21.06% mb1_ic3oc16_ih224oh112kh3sh2dh0ph1 _iw224ow112kw3sw2dw0pw1 | 76.607 | 61.847 | 19.27% mb30_ic256oc256_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 3109.625 | 3006.062 | 3.33% mb56_ic3oc28_ih24oh22kh3sh1dh0ph0 _iw24ow22kw3sw1dw0pw0 | 191.194 | 175.997 | 7.95% mb100_ic128oc128_ih16oh16kh3sh1dh0ph1 _iw16ow16kw3sw1dw0pw1 | 3435.625 | 3391.438 | 1.29% g2mb1_ic128oc256_ih28oh28kh3sh1dh0ph1 _iw28ow28kw3sw1dw0pw1 | 205.209 | 191.931 | 6.47% g32mb1_ic1024oc1024_ih14oh14kh3sh1dh0ph1 _iw14ow14kw3sw1dw0pw1 | 157.004 | 142.82 | 9.03% g64mb1_ic1024oc2048_ih14oh7kh3sh2dh0ph1 _iw14ow7kw3sw2dw0pw1 | 83.262 | 71.689 | 13.90% g256mb1_ic256oc256_ih10oh5kh3sh2dh0ph1 _iw10ow5kw3sw2dw0pw1 | 31.848 | 13.378 | 57.99% g512mb1_ic512oc512_ih19oh10kh3sh2dh0ph1 _iw19ow10kw3sw2dw0pw1 | 50.216 | 32.663 | 34.95% g1024mb1_ic1024oc1024_ih10oh10kh3sh1dh0ph1 _iw10ow10kw3sw1dw0pw1 | 67.359 | 49.779 | 26.10% Transposed Convolution, 1 core per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1_ic512oc256_ih4oh8kh4sh2dh0ph1_iw4ow8kw4sw2dw0pw1 | 537.12 | 505.142 | 5.95% mb1_ic256oc128_ih8oh16kh4sh2dh0ph1_iw8ow16kw4sw2dw0pw1 | 296.95 | 275.724 | 7.15% mb1_ic128oc64_ih16oh32kh4sh2dh0ph1_iw16ow32kw4sw2dw0pw1 | 266.933 | 251.175 | 5.90% mb1_ic64oc3_ih32oh64kh4sh2dh0ph1_iw32ow64kw4sw2dw0pw1 | 141.77 | 126.41 | 10.83% mb1_ic100oc512_ih1oh4kh4sh1dh0ph0_iw1ow4kw4sw1dw0pw0 | 89.511 | 66.719 | 25.46% Transposed Convolution, 4 cores per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1_ic512oc256_ih4oh8kh4sh2dh0ph1 _iw4ow8kw4sw2dw0pw1 | 181.594 | 163.77 | 9.82% mb1_ic256oc128_ih8oh16kh4sh2dh0ph1 _iw8ow16kw4sw2dw0pw1 | 163 | 145.104 | 10.98% mb1_ic128oc64_ih16oh32kh4sh2dh0ph1 _iw16ow32kw4sw2dw0pw1 | 163.158 | 150.71 | 7.63% mb1_ic64oc3_ih32oh64kh4sh2dh0ph1 _iw32ow64kw4sw2dw0pw1 | 109.955 | 98.603 | 10.32% mb1_ic100oc512_ih1oh4kh4sh1dh0ph0 _iw1ow4kw4sw1dw0pw0 | 69.502 | 54.523 | 21.55% Linear, 1 core per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1ic16oc8 | 54.415 | 35.816 | 34.18% mb1ic32oc16 | 26.764 | 16.041 | 40.07% mb1ic64oc32 | 26.735 | 16.007 | 40.13% mb1ic100oc1 | 26.712 | 16.06 | 39.88% mb1ic512oc1000 | 65.261 | 51.947 | 20.40% mb1ic1024oc1000 | 112.603 | 98.175 | 12.81% mb1ic2048oc1000 | 207.294 | 192.014 | 7.37% mb1ic4096oc4096 | 3761.094 | 3745.609 | 0.41% mb1ic9216oc4096 | 8918.672 | 8912.547 | 0.07% mb20ic2048oc91 | 52.487 | 44.623 | 14.98% mb30ic512oc37 | 29.257 | 19.642 | 32.86% mb100ic128oc256 | 39.32 | 29.81 | 24.19% mb100ic256oc512 | 74.499 | 64.322 | 13.66% mb100ic512oc1024 | 220.029 | 204.745 | 6.95% mb100ic1024oc784 | 352.311 | 336.309 | 4.54% Linear, 4 cores per instance, multiple instances on whole socket shape | onednn (old) | onednn (new) | Improvement -- | -- | -- | -- mb1ic16oc8 | 58.252 | 40.433 | 30.59% mb1ic32oc16 | 23.901 | 15.549 | 34.94% mb1ic64oc32 | 24.594 | 16.214 | 34.07% mb1ic100oc1 | 24.011 | 15.4 | 35.86% mb1ic512oc1000 | 49.781 | 41.988 | 15.65% mb1ic1024oc1000 | 70.304 | 61.88 | 11.98% mb1ic2048oc1000 | 92.259 | 85.715 | 7.09% mb1ic4096oc4096 | 794.937 | 781.137 | 1.74% mb1ic9216oc4096 | 2081.375 | 2067.75 | 0.65% mb20ic2048oc91 | 66.929 | 58.338 | 12.84% mb30ic512oc37 | 35.332 | 26.337 | 25.46% mb100ic128oc256 | 42.21 | 38.908 | 7.82% mb100ic256oc512 | 66.49 | 63.967 | 3.79% mb100ic512oc1024 | 130.828 | 122.673 | 6.23% mb100ic1024oc784 | 160.987 | 154.765 | 3.86% Environment: - PyTorch version: 1.13.0a0+gitcdd625b - Is debug build: False - CUDA used to build PyTorch: None - ROCM used to build PyTorch: N/A - OS: Ubuntu 20.04.3 LTS (x86_64) - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 - Clang version: Could not collect - CMake version: version 3.22.5 - Libc version: glibc-2.31 - Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime) - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31 - Is CUDA available: False - CUDA runtime version: No CUDA - GPU models and configuration: No CUDA - Nvidia driver version: No CUDA - cuDNN version: No CUDA - HIP runtime version: N/A - MIOpen runtime version: N/A - Is XNNPACK available: True Versions of relevant libraries: - [pip3] intel-extension-for-pytorch==1.13.0+cpu - [pip3] numpy==1.23.3 - [pip3] pytorch-widedeep==0.3.7 - [pip3] torch==1.13.0a0+git48b423b - [pip3] torchvision==0.14.0a0+ebb68f3 - [conda] blas 1.0 mkl - [conda] intel-extension-for-pytorch 1.13.0+cpu pypi_0 pypi - [conda] mkl 2021.4.0 h06a4308_640 - [conda] mkl-include 2022.1.0 pypi_0 pypi - [conda] mkl-service 2.4.0 py39h7f8727e_0 - [conda] mkl-static 2022.1.0 pypi_0 pypi - [conda] mkl_fft 1.3.1 py39hd3c417c_0 - [conda] mkl_random 1.2.2 py39h51133e4_0 - [conda] numpy 1.23.3 pypi_0 pypi - [conda] numpy-base 1.22.3 py39hf524024_0 - [conda] torch 1.13.0a0+git48b423b pypi_0 pypi - [conda] torchvision 0.14.0a0+ebb68f3 pypi_0 pypi Pull Request resolved: https://github.com/pytorch/pytorch/pull/84470 Approved by: https://github.com/jerryzh168
Author
Committer
Parents
Loading