move cuda abs to Aten (#25857)
Summary:
VitalyFedyunin, this PR fix the https://github.com/pytorch/pytorch/issues/24531
Benchmark script :
```
import timeit
device = "cuda"
for n, t in [(10, 100000),(1000, 10000)]:
print('a.abs() (a.numel() == {}) for {} times'.format(n, t))
for dtype in ('torch.int8', 'torch.uint8', 'torch.int16', 'torch.int32', 'torch.int64', 'torch.float', 'torch.double', 'torch.half'):
print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
print(timeit.timeit(f'a.abs()\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.ones({n}, device="{device}", dtype={dtype})', number=t))
```
Device: **Tesla P4**
Cuda verison: **9.0.176**
Before this change:
```
a.abs() (a.numel() == 10) for 100000 times
device: cuda, dtype: torch.int8, 100000 times 1.8391285985708237
device: cuda, dtype: torch.uint8, 100000 times 1.8831938095390797
device: cuda, dtype: torch.int16, 100000 times 1.8131775446236134
device: cuda, dtype: torch.int32, 100000 times 1.832334715873003
device: cuda, dtype: torch.int64, 100000 times 1.8218239657580853
device: cuda, dtype: torch.float, 100000 times 1.7942761108279228
device: cuda, dtype: torch.double, 100000 times 1.8193779103457928
device: cuda, dtype: torch.half, 100000 times 1.796515878289938
a.abs() (a.numel() == 1000) for 10000 times
device: cuda, dtype: torch.int8, 10000 times 0.18348361551761627
device: cuda, dtype: torch.uint8, 10000 times 0.1892806850373745
device: cuda, dtype: torch.int16, 10000 times 0.18253886327147484
device: cuda, dtype: torch.int32, 10000 times 0.18509215489029884
device: cuda, dtype: torch.int64, 10000 times 0.18291602283716202
device: cuda, dtype: torch.float, 10000 times 0.1796952784061432
device: cuda, dtype: torch.double, 10000 times 0.18088893592357635
device: cuda, dtype: torch.half, 10000 times 0.18222836777567863
```
After change:
```a.abs() (a.numel() == 10) for 100000 times
device: cuda, dtype: torch.int8, 100000 times 1.7365420907735825
device: cuda, dtype: torch.uint8, 100000 times 1.7433889284729958
device: cuda, dtype: torch.int16, 100000 times 1.7034666128456593
device: cuda, dtype: torch.int32, 100000 times 1.6825932636857033
device: cuda, dtype: torch.int64, 100000 times 1.6896217577159405
device: cuda, dtype: torch.float, 100000 times 1.7211194895207882
device: cuda, dtype: torch.double, 100000 times 1.6823345720767975
device: cuda, dtype: torch.half, 100000 times 1.7027524448931217
a.abs() (a.numel() == 1000) for 10000 times
device: cuda, dtype: torch.int8, 10000 times 0.17180879414081573
device: cuda, dtype: torch.uint8, 10000 times 0.17316896095871925
device: cuda, dtype: torch.int16, 10000 times 0.16990498825907707
device: cuda, dtype: torch.int32, 10000 times 0.1681906059384346
device: cuda, dtype: torch.int64, 10000 times 0.16994905844330788
device: cuda, dtype: torch.float, 10000 times 0.1719626784324646
device: cuda, dtype: torch.double, 10000 times 0.16886932775378227
device: cuda, dtype: torch.half, 10000 times 0.16957201063632965
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25857
Differential Revision: D18299368
Pulled By: VitalyFedyunin
fbshipit-source-id: 173eb0f6ca5a12a27f3d53466ff373a5f81f1da8