Speed up threshold on CPU. (#27155)
Summary:
This is a small fix, but the runtime improvement does seem consistent (a bit less than 10%):
Benchmark (no turbo, Release build, gcc 8.3, RHEL 7.7, Intel(R) Core(TM) i7-8850H):
```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'):
print(f'dtype={dtype}')
for n, t in [(70_000, 200000),
(700_000, 20000)]:
print(f'torch.nn.Threshold(0.1, 20)(a), numel() == {n} for {t} times')
print(timeit.timeit(f'm(a)', setup=f'import torch; m=torch.nn.Threshold(0.1, 20); a = torch.arange({n}, dtype={dtype})', number=t))
```
Before:
```
dtype=torch.double
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.88117562699972
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.525143070000013
dtype=torch.float
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.673380930000349
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.677610996000112
dtype=torch.int16
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
3.957677209999929
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
1.8512293700005102
dtype=torch.int32
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.624350482999944
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.670380037000541
dtype=torch.int64
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.86375758200029
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.468234717999621
```
After:
```
dtype=torch.double
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.64173036200009
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.456986365000375
dtype=torch.float
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.431988049000211
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.446968590000324
dtype=torch.int16
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
3.743787463999979
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
1.823233144000369
dtype=torch.int32
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.42801834400052
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.4600211680008215
dtype=torch.int64
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.562551314000302
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.37924196699987
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27155
Differential Revision: D17790768
Pulled By: VitalyFedyunin
fbshipit-source-id: 3281eaff77ddddd658048c9e73824dd68c548591