Vectorize softplus and its backward function on CPU (#32944)
Summary:
The benchmarking shows a huge performance gain (2-7x faster).
Also note that I removed Half support because it isn't generally supported on CPU.
Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz)
```python
import timeit
for op in ('Softplus',):
print('Forward')
for dtype in ('torch.double', 'torch.float'):
for n, t in [(10_000, 10000),
(100_000, 1000)]:
print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a)', setup=f'import torch; m = torch.nn.{op}(); a = torch.randn({n}, dtype={dtype})', number=t))
print('Backward')
for dtype in ('torch.double', 'torch.float'):
for n, t in [(10_000, 40000),
(100_000, 4000)]:
print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('y.backward(retain_graph=True)',
setup=f'import torch; m = torch.nn.{op}(); a = torch.randn({n}, dtype={dtype}, requires_grad=True); x = m(a); y = x.sum()',
number=t))
```
Before:
```
Forward
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.double
3.73130346799735
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.double
3.6790116359916283
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.float
2.7477027159911813
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.float
2.7382752639969112
Backward
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.double
7.037510035006562
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.double
5.855093962003593
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.float
3.413616877005552
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.float
2.5485514330066508
```
After:
```
Forward
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.double
0.9465823079954134
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.double
0.8799468770012027
torch.nn.Softplus()(a), numel() == 10000 for 10000 times, dtype=torch.float
0.39715987400268205
torch.nn.Softplus()(a), numel() == 100000 for 1000 times, dtype=torch.float
0.3563060039887205
Backward
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.double
2.400547721001203
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.double
1.4740848699875642
torch.nn.Softplus()(a), numel() == 10000 for 40000 times, dtype=torch.float
1.6684603010071442
torch.nn.Softplus()(a), numel() == 100000 for 4000 times, dtype=torch.float
0.6815649690106511
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32944
Differential Revision: D19725407
Pulled By: VitalyFedyunin
fbshipit-source-id: 7430de838df731bd17617eff63f10107d5ad6b8b