Migrate acos from TH to ATen (CUDA) (#29323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29323
Benchmark (Debian Buster, gcc 7.4, Release build, P400, turbo off):
```python
import timeit
for n, t in [(10_000, 20000),
(100_000, 20000)]:
for dtype in ('torch.half', 'torch.float', 'torch.double'):
print(f'torch.acos(a) a.numel() == {n} for {t} times {dtype}')
print(timeit.timeit(f'torch.acos(a); torch.cuda.synchronize()', setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")', number=t))
```
Before:
```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3783099120009865
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.37258279799971206
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5627449999992677
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8581132070012245
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0164795860000595
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.644646360999104
```
After:
```
torch.acos(a) a.numel() == 10000 for 20000 times torch.half
0.3873771430007764
torch.acos(a) a.numel() == 10000 for 20000 times torch.float
0.38498222500038537
torch.acos(a) a.numel() == 10000 for 20000 times torch.double
0.5826049269999203
torch.acos(a) a.numel() == 100000 for 20000 times torch.half
0.8118497010000283
torch.acos(a) a.numel() == 100000 for 20000 times torch.float
1.0175845949997893
torch.acos(a) a.numel() == 100000 for 20000 times torch.double
2.658536324999659
```
Close #24532
Test Plan: Imported from OSS
Differential Revision: D18406806
Pulled By: VitalyFedyunin
fbshipit-source-id: 2d012485f4747fae0ddbcf2e08b1d75ef5274a19