pytorch
5cf64415 - Speed up fill for half and bfloat16 on CPU. (#28397)

Commit
5 years ago
Speed up fill for half and bfloat16 on CPU. (#28397) Summary: This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all sorts of AVX optimization available. Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136): ```python import timeit for dtype in ('torch.bfloat16', 'torch.half'): for n, t in [(40_000, 600000), (400_000, 60000)]: print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})') print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t)) ``` Before: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 11.064065577999827 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 10.618151295000189 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 10.989039544000207 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 10.602233665999847 ``` After: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 1.530125006000162 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 1.4807136570002513 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 1.3946152990001792 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 1.457788402999995 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/28397 Differential Revision: D18125171 Pulled By: ezyang fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13
Author
Parents
Loading