Vectorize int8_t on CPU (#44759)
Summary:
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in https://github.com/pytorch/pytorch/issues/43033, this is an important type for vectorization because
a lot of images are loaded in this data type.
Related issue: https://github.com/pytorch/pytorch/issues/43033
Benchmark (Debian Buster, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):
```python
import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
for n, t in [(10_000, 200000),
(100_000, 20000)]:
print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```
Results:
Before:
```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235
```
After:
```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44759
Reviewed By: malfet
Differential Revision: D23786383
Pulled By: glaringlee
fbshipit-source-id: 67f5bcd344c0b5014bacbc876143231fca156713