Migrate `sin` and `sin_` from the TH to Aten (CUDA) (#28237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28237
Benchmark (RHEL 7, gcc 8.3.1, P1000):
```python
import timeit
for n, t in [(10_000, 20000),
(100_000, 20000)]:
for dtype in ('torch.half', 'torch.float', 'torch.double'):
print(f'torch.sin(a) a.numel() == {n} for {t} times {dtype}')
print(timeit.timeit(f'torch.sin(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```
Before:
```
torch.sin(a) a.numel() == 10000 for 20000 times torch.half
0.4649172620011086
torch.sin(a) a.numel() == 10000 for 20000 times torch.float
0.4616892600006395
torch.sin(a) a.numel() == 10000 for 20000 times torch.double
0.5166665920005471
torch.sin(a) a.numel() == 100000 for 20000 times torch.half
0.5376560490003612
torch.sin(a) a.numel() == 100000 for 20000 times torch.float
0.6207812359989475
torch.sin(a) a.numel() == 100000 for 20000 times torch.double
1.873208982999131
```
After:
```
torch.sin(a) a.numel() == 10000 for 20000 times torch.half
0.4796977340010926
torch.sin(a) a.numel() == 10000 for 20000 times torch.float
0.48329569199995603
torch.sin(a) a.numel() == 10000 for 20000 times torch.double
0.5380683220009814
torch.sin(a) a.numel() == 100000 for 20000 times torch.half
0.5299932739999349
torch.sin(a) a.numel() == 100000 for 20000 times torch.float
0.6144487999990815
torch.sin(a) a.numel() == 100000 for 20000 times torch.double
1.8838113630008593
```
Close #24627
Test Plan: Imported from OSS
Differential Revision: D18089072
Pulled By: VitalyFedyunin
fbshipit-source-id: 4824804960309fe7fdb16073d021388704986993