Migrate `asin` and `asin_` from the TH to Aten (CUDA) (#28482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28482
Benchmark (RHEL 7.3, Release, P1000, gcc 8.3):
```python
import timeit
for n, t in [(10_000, 20000),
(100_000, 20000)]:
for dtype in ('torch.half', 'torch.float', 'torch.double'):
print(f'torch.asin(a) a.numel() == {n} for {t} times {dtype}')
print(timeit.timeit(f'torch.asin(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```
Before:
```
torch.asin(a) a.numel() == 10000 for 20000 times torch.half
0.475854377997166
torch.asin(a) a.numel() == 10000 for 20000 times torch.float
0.4772826389998954
torch.asin(a) a.numel() == 10000 for 20000 times torch.double
0.6297428649995709
torch.asin(a) a.numel() == 100000 for 20000 times torch.half
0.5475849750000634
torch.asin(a) a.numel() == 100000 for 20000 times torch.float
0.6156488769993302
torch.asin(a) a.numel() == 100000 for 20000 times torch.double
2.728912709000724
```
After:
```
torch.asin(a) a.numel() == 10000 for 20000 times torch.half
0.5107104659982724
torch.asin(a) a.numel() == 10000 for 20000 times torch.float
0.509122366001975
torch.asin(a) a.numel() == 10000 for 20000 times torch.double
0.6929216960015765
torch.asin(a) a.numel() == 100000 for 20000 times torch.half
0.5914848840002378
torch.asin(a) a.numel() == 100000 for 20000 times torch.float
0.6518679289983993
torch.asin(a) a.numel() == 100000 for 20000 times torch.double
2.916458261999651
```
Close #24537
Test Plan: Imported from OSS
Differential Revision: D18089074
Pulled By: VitalyFedyunin
fbshipit-source-id: f27515dd1ee73b6e2391ebcc0004af28bcb82234