[cpu] explicitly vectorize digamma (#110217)
### Benchmarking results
```python
[-------------- torch.digamma(x) Benchmark -------------]
| implicitly vectorized | explicitly vectorized
1 threads: -----------------------------------------------------------------------
dtype torch.float16 - n : 100 | 3.8 | 3.5
dtype torch.float16 - n : 200 | 5.8 | 5.3
dtype torch.float16 - n : 500 | 11.8 | 10.7
dtype torch.float16 - n : 1000 | 22.0 | 19.6
dtype torch.float16 - n : 10000 | 203.6 | 179.7
dtype torch.float32 - n : 100 | 3.8 | 3.6
dtype torch.float32 - n : 200 | 5.7 | 5.5
dtype torch.float32 - n : 500 | 11.1 | 11.1
dtype torch.float32 - n : 1000 | 20.6 | 20.5
dtype torch.float32 - n : 10000 | 191.7 | 189.6
dtype torch.float64 - n : 100 | 3.8 | 3.7
dtype torch.float64 - n : 200 | 5.9 | 5.7
dtype torch.float64 - n : 500 | 11.9 | 11.7
dtype torch.float64 - n : 1000 | 22.1 | 21.7
dtype torch.float64 - n : 10000 | 203.6 | 199.7
dtype torch.bfloat16 - n : 100 | 3.7 | 3.5
dtype torch.bfloat16 - n : 200 | 5.6 | 5.3
dtype torch.bfloat16 - n : 500 | 11.2 | 10.6
dtype torch.bfloat16 - n : 1000 | 20.8 | 19.5
dtype torch.bfloat16 - n : 10000 | 190.0 | 179.7
Times are in microseconds (us).`
```
### Benchmarking config
Machine: Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz
<p>
```python
>>> import torch
>>> print(f"Torch config: {torch.__config__.show()}")
Torch config: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/usr/local/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.2.0, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=0, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,
```
</p>
Script -
```
import torch
import pickle
from torch.utils import benchmark
from itertools import product
device = 'cpu'
dtypes = (torch.float16, torch.float32, torch.float64, torch.bfloat16)
n = (100, 200, 500, 1000, 10000)
result = []
for dtype, num in product(dtypes, n):
x = torch.rand(num, dtype=dtype, device='cpu')
torch.digamma(x)
stmt = 'torch.digamma(x)'
measurement = benchmark.Timer(
stmt=stmt,
globals={'x': x},
label=stmt + " Benchmark",
sub_label=f"dtype {dtype} - n : {num}",
description="vectorized",
).blocked_autorange(min_run_time=10)
result.append(measurement)
fname_prefix = "benchmark_digamma_"
benchmark.Compare(result).print()
with open(fname_prefix+"vectorized", "wb") as f:
pickle.dump(result, f)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110217
Approved by: https://github.com/sanchitintel, https://github.com/vfdev-5, https://github.com/ezyang