pytorch
495f77be - [cpu] explicitly vectorize digamma (#110217)

Commit
1 year ago
[cpu] explicitly vectorize digamma (#110217) ### Benchmarking results ```python [-------------- torch.digamma(x) Benchmark -------------] | implicitly vectorized | explicitly vectorized 1 threads: ----------------------------------------------------------------------- dtype torch.float16 - n : 100 | 3.8 | 3.5 dtype torch.float16 - n : 200 | 5.8 | 5.3 dtype torch.float16 - n : 500 | 11.8 | 10.7 dtype torch.float16 - n : 1000 | 22.0 | 19.6 dtype torch.float16 - n : 10000 | 203.6 | 179.7 dtype torch.float32 - n : 100 | 3.8 | 3.6 dtype torch.float32 - n : 200 | 5.7 | 5.5 dtype torch.float32 - n : 500 | 11.1 | 11.1 dtype torch.float32 - n : 1000 | 20.6 | 20.5 dtype torch.float32 - n : 10000 | 191.7 | 189.6 dtype torch.float64 - n : 100 | 3.8 | 3.7 dtype torch.float64 - n : 200 | 5.9 | 5.7 dtype torch.float64 - n : 500 | 11.9 | 11.7 dtype torch.float64 - n : 1000 | 22.1 | 21.7 dtype torch.float64 - n : 10000 | 203.6 | 199.7 dtype torch.bfloat16 - n : 100 | 3.7 | 3.5 dtype torch.bfloat16 - n : 200 | 5.6 | 5.3 dtype torch.bfloat16 - n : 500 | 11.2 | 10.6 dtype torch.bfloat16 - n : 1000 | 20.8 | 19.5 dtype torch.bfloat16 - n : 10000 | 190.0 | 179.7 Times are in microseconds (us).` ``` ### Benchmarking config Machine: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz <p> ```python >>> import torch >>> print(f"Torch config: {torch.__config__.show()}") Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201703 - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - CPU capability usage: AVX2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.2.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, ``` </p> Script - ``` import torch import pickle from torch.utils import benchmark from itertools import product device = 'cpu' dtypes = (torch.float16, torch.float32, torch.float64, torch.bfloat16) n = (100, 200, 500, 1000, 10000) result = [] for dtype, num in product(dtypes, n): x = torch.rand(num, dtype=dtype, device='cpu') torch.digamma(x) stmt = 'torch.digamma(x)' measurement = benchmark.Timer( stmt=stmt, globals={'x': x}, label=stmt + " Benchmark", sub_label=f"dtype {dtype} - n : {num}", description="vectorized", ).blocked_autorange(min_run_time=10) result.append(measurement) fname_prefix = "benchmark_digamma_" benchmark.Compare(result).print() with open(fname_prefix+"vectorized", "wb") as f: pickle.dump(result, f) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110217 Approved by: https://github.com/sanchitintel, https://github.com/vfdev-5, https://github.com/ezyang
Author
Committer
Parents
Loading