pytorch
66f07c0c - Optimized bilinear interpolation using TensorIterator (#51653)

Commit
3 years ago
Optimized bilinear interpolation using TensorIterator (#51653) Summary: Related to https://github.com/pytorch/pytorch/issues/10482 Description: - Optimized bilinear interpolation for 1d, 2d, 3d cases using TensorIterator <details> <summary> Interpolation 2d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 0.3938 | 0.0782 | 5.0339 [1, 3, 320, 320] | [512, 512] | True | False | 1.5585 | 0.4105 | 3.7965 [1, 3, 320, 320] | [256, 256] | False | False | 0.3481 | 0.0760 | 4.5780 [1, 3, 320, 320] | [512, 512] | False | False | 1.5848 | 0.4091 | 3.8734 [1, 3, 320, 320] | [256, 256] | False | True | 1.2058 | 1.2034 | 1.0020 [1, 3, 320, 320] | [512, 512] | False | True | 4.8691 | 4.8537 | 1.0032 [32, 128, 64, 64] | [32, 32] | False | True | 6.3915 | 6.4041 | 0.9980 [32, 128, 64, 64] | [128, 128] | False | True | 166.1769 | 164.5621 | 1.0098 [32, 128, 64, 64] | [32, 32] | True | False | 3.7194 | 2.4720 | 1.5046 [32, 128, 64, 64] | [128, 128] | True | False | 86.6704 | 52.3754 | 1.6548 [1, 3, 500, 500] | [256, 256] | True | False | 0.3270 | 0.0792 | 4.1307 [1, 3, 500, 500] | [800, 800] | True | False | 3.3116 | 0.5567 | 5.9482 [1, 3, 500, 500] | [256, 256] | False | False | 0.3763 | 0.0773 | 4.8700 [1, 3, 500, 500] | [800, 800] | False | False | 3.2577 | 0.5590 | 5.8279 </details> <details> <summary> Interpolation 1d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 0.2795 | 0.1032 | 2.7089 [4, 512, 320] | 512 | True | False | 0.5533 | 0.1888 | 2.9303 </details> <details> <summary> Interpolation 3d - 6 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 4.4105 | 2.1236 | 2.0769 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 83.9426 | 42.6641 | 1.9675 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 15.5736 | 15.5758 | 0.9999 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 272.4795 | 273.2745 | 0.9971 </details> <details> <summary> Interpolation 2d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 320, 320] | [256, 256] | True | False | 1.0240 | 0.4145 | 2.4705 [1, 3, 320, 320] | [512, 512] | True | False | 4.0771 | 1.3836 | 2.9467 [1, 3, 320, 320] | [256, 256] | False | False | 0.9771 | 0.3270 | 2.9878 [1, 3, 320, 320] | [512, 512] | False | False | 4.1732 | 1.2209 | 3.4180 [1, 3, 320, 320] | [256, 256] | False | True | 1.5466 | 1.5363 | 1.0067 [1, 3, 320, 320] | [512, 512] | False | True | 6.1555 | 6.1199 | 1.0058 [32, 128, 64, 64] | [32, 32] | False | True | 27.6362 | 27.5901 | 1.0017 [32, 128, 64, 64] | [128, 128] | False | True | 468.6442 | 465.5163 | 1.0067 [32, 128, 64, 64] | [32, 32] | True | False | 20.1495 | 10.0694 | 2.0011 [32, 128, 64, 64] | [128, 128] | True | False | 400.0401 | 204.0662 | 1.9603 [1, 3, 500, 500] | [256, 256] | True | False | 0.8956 | 0.3366 | 2.6606 [1, 3, 500, 500] | [800, 800] | True | False | 8.6554 | 2.9530 | 2.9310 [1, 3, 500, 500] | [256, 256] | False | False | 1.0921 | 0.3385 | 3.2263 [1, 3, 500, 500] | [800, 800] | False | False | 8.9594 | 2.9627 | 3.0241 </details> <details> <summary> Interpolation 1d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [4, 512, 320] | 256 | True | False | 1.5233 | 0.5027 | 3.0301 [4, 512, 320] | 512 | True | False | 3.0302 | 0.9735 | 3.1128 </details> <details> <summary> Interpolation 3d - 1 thread(s) </summary> In | Out | Is contiguous | Channels last | master | this PR | speed-up ---|---|---|---|---|---|--- [1, 3, 16, 320, 320] | [8, 256, 256] | True | False | 12.0477 | 11.3196 | 1.0643 [1, 3, 16, 320, 320] | [32, 512, 512] | True | False | 222.8618 | 209.9955 | 1.0613 [1, 3, 16, 320, 320] | [8, 256, 256] | False | True | 17.9883 | 17.9937 | 0.9997 [1, 3, 16, 320, 320] | [32, 512, 512] | False | True | 380.7244 | 380.1916 | 1.0014 </details> <details> <summary> Versions and build configs </summary> PyTorch master: 1.9.0.dev20210223 PyTorch master build setting: ``` BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, ``` PR : 1.9.0a0+74b172b PR build setting: ``` BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/g++-7, CXX_FLAGS=-O3 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, ``` </details> This description is based on the benchmarks and the code from [here](https://github.com/vfdev-5/interpolate-tensoriterator/tree/master/step_six). TL;DR - Linear upsampling generic implementation using TensorIterator for Nd case (single loop function for 1d, 2d and 3d cases) - can be generalized to nearest, bicubic interpolation modes. - works for channels first and last cases. Joint work with Francisco Massa (fmassa). Pull Request resolved: https://github.com/pytorch/pytorch/pull/51653 Reviewed By: malfet Differential Revision: D26619437 Pulled By: fmassa fbshipit-source-id: 7d435e23881c5b40a18bf0dbcab4906d5462025f
Author
Parents
Loading