Speed up fill for half and bfloat16 on CPU. (#28397)
Summary:
This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):
```python
import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
for n, t in [(40_000, 600000),
(400_000, 60000)]:
print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))
```
Before:
```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847
```
After:
```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28397
Differential Revision: D18125171
Pulled By: ezyang
fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13