Vectorize bitwise_not (#45103)
Summary:
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R)
E-2136 CPU @ 3.30GHz):
```python
import timeit
for dtype in ('torch.int64', 'torch.int32', 'torch.int16', 'torch.int8', 'torch.uint8'):
for n, t in [(10_000, 100000),
(100_000, 10000)]:
print(f'torch.bitwise_not(a), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit('torch.bitwise_not(a)', setup=f'import torch; a = torch.arange(-{n//2}, {n//2}, dtype={dtype})', number=t))
```
Before:
```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5479081739904359
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.3350257440470159
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.39590477803722024
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.25563537096604705
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.31152817397378385
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.20817365101538599
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.8573925020173192
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.4150037349900231
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.8551108679967001
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.37137620500288904
```
After:
```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5232444299617782
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.33852163201663643
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.3931163849774748
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.24392802000511438
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.3122224889229983
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.1977886479580775
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.26711542706470937
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.18208567495457828
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.2615354140289128
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.17972210398875177
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45103
Reviewed By: ailzhang
Differential Revision: D23848675
Pulled By: ezyang
fbshipit-source-id: 6dde1ab32d9a343a49de66ad9f9b062fa23824d2