Vectorize SmoothL1Loss forward (CPU) (#37115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37115
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz):
```python
import timeit
for op in ('SmoothL1Loss',):
print('Forward')
for dtype in ('torch.double', 'torch.float', 'torch.bfloat16'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.nn.{op}()(a, b), |a-b|>1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 3, dtype={dtype})', number=t))
print(f'torch.nn.{op}()(a, b), |a-b|<1, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('m(a, b)', setup=f'import torch; m = torch.nn.{op}(); a = torch.full(({n},), 1, dtype={dtype}); b = torch.full(({n},), 1.5, dtype={dtype})', number=t))
```
Results:
Before:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8427017140056705
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.823863306999556
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9239509999897564
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9014650480094133
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4530331650021253
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4551637870026752
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5716871829936281
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5748704470024677
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
9.777982015002635
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
12.627838339001755
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
7.810075458997744
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
10.73597132100258
```
After:
```
Forward
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.double
2.8420191049808636
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.double
2.8814279660000466
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.double
0.9491433810035232
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.double
0.9144560259883292
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.float
2.4458729829930235
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.float
2.4474395569995977
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.float
0.5676976410031784
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.float
0.5793530470109545
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.32380092900712
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 10000 for 100000 times, dtype=torch.bfloat16
4.332892568985699
torch.nn.SmoothL1Loss()(a, b), |a-b|>1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3354615129937883
torch.nn.SmoothL1Loss()(a, b), |a-b|<1, numel() == 100000 for 10000 times, dtype=torch.bfloat16
2.3352111729909666
```
Test Plan: Imported from OSS
Differential Revision: D21351860
Pulled By: VitalyFedyunin
fbshipit-source-id: b19ca1e58586d964972e5c495aba10c8808cd747