Vectorize elu and its backward function on CPU (#32986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32986
Benchmark: (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz)
```python
import timeit
for op in ('ELU',):
print('Forward')
for dtype in ('torch.double', 'torch.float'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a)', setup=f'import torch; m = torch.nn.{op}(); a = torch.linspace(-1, 1, {n}, dtype={dtype})', number=t))
print('Backward')
for dtype in ('torch.double', 'torch.float'):
for n, t in [(20_000, 100000),
(200_000, 10000)]:
print(f'torch.nn.{op}()(a), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('y.backward(retain_graph=True)',
setup=f'import torch; m = torch.nn.{op}(); a = torch.linspace(-1, 1, {n}, requires_grad=True, dtype={dtype}); x = m(a); y = x.sum()',
number=t))
```
Before:
```
Forward
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.double
5.292799739996553
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.double
4.828570917001343
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.float
3.1359513780043926
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.float
2.7030876770004397
Backward
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.double
4.568238995998399
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.double
1.8908141480060294
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.float
3.8652471189998323
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.float
1.13068484600808
```
After:
```
Forward
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.double
2.1265591429983033
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.double
1.6708065870043356
torch.nn.ELU()(a), numel() == 10000 for 100000 times, dtype=torch.float
1.1806934149935842
torch.nn.ELU()(a), numel() == 100000 for 10000 times, dtype=torch.float
0.77735430400935
Backward
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.double
4.494567882007686
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.double
2.007220732004498
torch.nn.ELU()(a), numel() == 20000 for 100000 times, dtype=torch.float
3.615133151994087
torch.nn.ELU()(a), numel() == 200000 for 10000 times, dtype=torch.float
1.105554559995653
```
Test Plan: Imported from OSS
Differential Revision: D19794595
Pulled By: VitalyFedyunin
fbshipit-source-id: c319ec04676ced22179b8b34789ac8bf6428deab