pytorch
2b02d154 - Implement fast pass for CPU scalars /number literals (#29915)

Commit
5 years ago
Implement fast pass for CPU scalars /number literals (#29915) Summary: The main changes in this PR are: - skip device dispatch for CPU scalars (number literals also fall into this). In most cases scalars should be on CPU for best perf, but if users explicitly put on other device, we will respect that setting and exit fast pass. - directly manipulate Tensor data_ptr when filling scalar into a 1-element tensor. Some perf benchmark numbers: ``` ## Before In [4]: def test(x): ...: x = x + 2 ...: return x ...: In [5]: with torch.no_grad(): ...: x = torch.ones(100) ...: %timeit {test(x)} ...: 79.8 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ## After In [2]: def test(x): ...: x = x + 2 ...: return x ...: In [3]: with torch.no_grad(): ...: x = torch.ones(100) ...: %timeit {test(x)} ...: 60.5 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` Before the patch `tensor_slow` took 15.74% of total time. <img width="1186" alt="Screen Shot 2019-11-15 at 12 49 51 PM" src="https://user-images.githubusercontent.com/5248122/68976895-cc808c00-07ab-11ea-8f3c-7f15597d12cf.png"> After the patch `tensor_slow` takes 3.84% of total time. <img width="1190" alt="Screen Shot 2019-11-15 at 1 13 03 PM" src="https://user-images.githubusercontent.com/5248122/68976925-e28e4c80-07ab-11ea-94c0-91172fc3bb53.png"> cc: roosephu who originally reported this issue to me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29915 Differential Revision: D18584251 Pulled By: ailzhang fbshipit-source-id: 2353c8012450a81872e1e09717b3b181362be401
Author
Ailing Zhang
Parents
Loading