Port cuda sigmoid to Aten(CUDA) (#26643)
Summary:
VitalyFedyunin, this PR port cuda sigmoid to Aten: https://github.com/pytorch/pytorch/issues/24624; TH/THC sigmoid code can 't be removed because the sigmoid_backward in THNN/THCUNN rely on it. I will port sigmoid_backward to Aten next step, incuding CPU and CUDA, which will remove the sigmoid code in TH/THC .
Test script:
```
import timeit
device = "cuda"
for n, t in [(10, 100000),(1000, 10000)]:
print('a.sigmoid() (a.numel() == {}) for {} times'.format(n, t))
for dtype in ('torch.float', 'torch.double', 'torch.half'):
print(f'device: {device}, dtype: {dtype}, {t} times', end='\t\t')
print(timeit.timeit(f'a.sigmoid()\nif "{device}" == "cuda": torch.cuda.synchronize()', setup=f'import torch; a = torch.ones({n}, device="{device}", dtype={dtype})', number=t))
```
Device: **Tesla P40**
Before:
```
a.sigmoid() (a.numel() == 10) for 100000 times
device: cuda, dtype: torch.float, 100000 times 1.2853778750286438
device: cuda, dtype: torch.double, 100000 times 1.2787265420192853
device: cuda, dtype: torch.half, 100000 times 1.2610833930084482
a.sigmoid() (a.numel() == 1000) for 10000 times
device: cuda, dtype: torch.float, 10000 times 0.1274153349804692
device: cuda, dtype: torch.double, 10000 times 0.13953313598176464
device: cuda, dtype: torch.half, 10000 times 0.1265286349807866
```
After:
```
a.sigmoid() (a.numel() == 10) for 100000 times
device: cuda, dtype: torch.float, 100000 times 1.275270765996538
device: cuda, dtype: torch.double, 100000 times 1.285128042974975
device: cuda, dtype: torch.half, 100000 times 1.2761492819990963
a.sigmoid() (a.numel() == 1000) for 10000 times
device: cuda, dtype: torch.float, 10000 times 0.12851508799940348
device: cuda, dtype: torch.double, 10000 times 0.13738596899202093
device: cuda, dtype: torch.half, 10000 times 0.12715664599090815
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26643
Differential Revision: D17666550
Pulled By: VitalyFedyunin
fbshipit-source-id: 376479d94d0649c171fd0b2557699bbdd050fec3