Add mkldnn tanh operator (#54656)
Summary:
## :rocket: Feature
Add Mkl-Layout kernel for tanh.
## Motivation
We want to add a Mkl-Layout kernel for tanh to improve tanh's performance when the input Tensor is Mkl-Layout.
Because, PyTorch does not have the Mkl-Layout kernel for tanh, so it cannot execute the tanh input by the Mkl-Layout Tensor.
Off course you can temporarily avoid this problem by executing to_dense/to_mkldnn, but the performance is significantly reduced due to the copy overhead(1.6-4.3 times slower than CPU kernel).
## Perfomance results
### Environment
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- OS: 18.04.1 LTS
- compiler: gcc 7.5.0
- branch: master
- commit ID: fe2c126
- build Environment variable: USE_CUDA=0
- Python: 3.6.9
- Intel MKL(Math Kernel Library): 2020.2-254
- Intel oneDNN: 1.8.1
### Benchmark script
``` python
import torch
import torch.nn as nn
torch.manual_seed(1)
x = torch.randn(2048, 2048)
x_mkl = x.to_mkldnn()
print("### CPU tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### CPU tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### to_dense/to_mkldnn + tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x_mkl.to_dense().tanh().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### to_dense/to_mkldnn + tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x_mkl.to_dense().tanh_().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### Mkl-Layout tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x_mkl.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### Mkl-Layout tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x_mkl.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
```
### Results
#### OMP_NUM_THREADS=1 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 579.662 | 1658.000 | 617.565 |
| tanh_ | 554.477 | 881.997 | 589.426 |
#### OMP_NUM_THREADS=6 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 182.387 | 421.336 | 136.226 |
| tanh_ | 94.331 | 404.931 | 99.254 |
## Modification policy for the code
oneDNN is already supported tanh operation.
[oneDNN: Elementwise](https://spec.oneapi.com/versions/latest/elements/oneDNN/source/primitives/eltwise.html)
There is already exist sigmoid implementation that uses the same Elementwise API as tanh, so we created this PR code with reference to the sigmoid implementation.
https://github.com/pytorch/pytorch/blob/527c1e0e37b7c65148bcbc390b65e94fb4624a9d/aten/src/ATen/native/mkldnn/UnaryOps.cpp#L28-L42
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54656
Test Plan:
A test for sigmoid has already been created as shown below.
So, I added a new test of tanh referring to the test of sigmoid.
https://github.com/pytorch/pytorch/blob/527c1e0e37b7c65148bcbc390b65e94fb4624a9d/test/test_mkldnn.py#L944-L954
### mkldnn tanh test result
```
$ python3 test/test_mkldnn.py TestMkldnn.test_tanh
Couldn't download test skip set, leaving all tests enabled...
.
----------------------------------------------------------------------
Ran 1 test in 0.004s
OK
```
Reviewed By: gchanan
Differential Revision: D27395827
Pulled By: ezyang
fbshipit-source-id: d4481332de187e2dea095f9b6aabc73a497960fe