Vectorize smooth L1 loss backward function on CPU. (#30046)
Summary:
Benchmark (Intel i7-8850H, turbo off, release build, RHEL 7.7):
```
import timeit
for dtype in ('torch.float', 'torch.double'):
print(f'dtype={dtype}')
for n, t in [(10_000, 100000),
(100_000, 20000)]:
print(f'numel() == {n} for {t} times')
print(timeit.timeit('output.backward(retain_graph=True)', number=t, setup=f"""
import torch
loss = torch.nn.SmoothL1Loss()
input = torch.randn({n}, requires_grad=True)
target = torch.randn({n})
output = loss(input, target)
"""))
```
Before:
```
dtype=torch.float
numel() == 10000 for 100000 times
6.154701935998673
numel() == 100000 for 20000 times
5.157296671999575
dtype=torch.double
numel() == 10000 for 100000 times
6.195317157000318
numel() == 100000 for 20000 times
5.099748799999361
```
After:
```
dtype=torch.float
numel() == 10000 for 100000 times
4.968745516000126
numel() == 100000 for 20000 times
2.4029395039997326
dtype=torch.double
numel() == 10000 for 100000 times
4.9910852479988534
numel() == 100000 for 20000 times
2.4867371629989066
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30046
Differential Revision: D18602399
Pulled By: VitalyFedyunin
fbshipit-source-id: 4c6c7b7b69ad6bce759786ddd7d6bc1e88ecf6ab