Migrate `tan` from TH to ATen (CUDA) (#36906)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24641
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti
```python
import timeit
for n, t in [(10_000, 20000),
(100_000, 20000)]:
for dtype in ('torch.half', 'torch.float', 'torch.double'):
print(f'torch.tan(a) a.numel() == {n} for {t} times {dtype}')
print(timeit.timeit(f'torch.tan(a); torch.cuda.synchronize()',
setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
number=t))
```
Before:
```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28325206200003095
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.28363607099998944
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.43924326799998425
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3754699589999859
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.38143782899999223
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7672172019999834
```
After:
```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28982524599996395
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.29121579000002384
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.4599610559998837
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3557764019997194
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.34793807599999127
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7564662459999454
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36906
Differential Revision: D21335320
Pulled By: VitalyFedyunin
fbshipit-source-id: efab9c175c60fb09223105380d48b93a81994fb0