Vectorize (CPU) generic types for binary bitwise operators (#34338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34338
For those types not optimized for AVX2, this commit would give bitwise
operations on them a boost.
Benchmark (RHEL 7.7, Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, Turbo off, Release build):
```python
import timeit
for op in ('bitwise_and', 'bitwise_or', 'bitwise_xor'):
for dtype in ('torch.int8', 'torch.uint8'):
for n, t in [(10_000, 200000),
(100_000, 20000)]:
print(f'a.{op}_(b), numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit(f'a.{op}_(b)', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```
Before:
```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.353799690001324
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.056434961999912
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2957618809996347
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0591609650000464
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3113185389993305
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0693870880022587
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.3075691039994126
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.0589785859992844
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
1.3036618039986934
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
1.0595013140009542
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.2947387999993225
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
1.059969027999614
```
After:
```
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9562859639991075
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6811799210008758
a.bitwise_and_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.9522694869992847
a.bitwise_and_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.6815469840003061
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.8609786279994296
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.5794818879985542
a.bitwise_or_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
0.8534434389985108
a.bitwise_or_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.5764101290005783
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.int8
0.9634105910008657
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.int8
0.6819724230008433
a.bitwise_xor_(b), numel() == 10000 for 200000 times, dtype=torch.uint8
1.0901075929978106
a.bitwise_xor_(b), numel() == 100000 for 20000 times, dtype=torch.uint8
0.816546294001455
```
Test Plan: Imported from OSS
Differential Revision: D20687081
Pulled By: ezyang
fbshipit-source-id: 59b06460430ce181fb761e45a5bdd6379611b391