pytorch
14acf92b - [PyTorch] Speed up Tensor::data_ptr (#53723)

Commit
3 years ago
[PyTorch] Speed up Tensor::data_ptr (#53723) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53723 We know the size of the data item at compile time. Let's take advantage of that instead of doing a runtime multiplication by the data type size. (Presumably, constant propagating through `data_type.itemsize()` to optimize the `imul` away was just a bridge too far for clang -- I checked assembly and we went from a load-and-`imul` to a `lea` that multiplied by constant 4 for `data_ptr<float>()`.) ghstack-source-id: 123559924 Test Plan: Compared `perf stat` output for Mergenet AdIndexer benchmark before/after this change: Before: ``` 16,943.46 msec task-clock # 0.999 CPUs utilized ( +- 0.16% ) 3,771 context-switches # 0.223 K/sec ( +- 15.86% ) 3 cpu-migrations # 0.000 K/sec 101,660 page-faults # 0.006 M/sec ( +- 1.00% ) 33,519,516,740 cycles # 1.978 GHz ( +- 0.20% ) (49.99%) 68,556,471,199 instructions # 2.05 insn per cycle ( +- 0.08% ) (49.98%) 11,100,415,689 branches # 655.145 M/sec ( +- 0.12% ) (50.02%) 73,082,369 branch-misses # 0.66% of all branches ( +- 0.45% ) (50.01%) ``` After: ``` 16,779.99 msec task-clock # 0.999 CPUs utilized ( +- 0.40% ) 2,815 context-switches # 0.168 K/sec ( +- 7.89% ) 3 cpu-migrations # 0.000 K/sec ( +- 6.25% ) 100,124 page-faults # 0.006 M/sec ( +- 0.40% ) 33,213,000,715 cycles # 1.979 GHz ( +- 0.39% ) (49.99%) 68,359,270,731 instructions # 2.06 insn per cycle ( +- 0.08% ) (50.00%) 11,058,104,630 branches # 659.005 M/sec ( +- 0.11% ) (50.00%) 72,840,016 branch-misses # 0.66% of all branches ( +- 0.51% ) (49.99%) ``` 0.9% cycles win and 0.29% instruction count win, both of which seem to be outside the noise. Reviewed By: bhosmer Differential Revision: D26919110 fbshipit-source-id: 23fab7adcfcf6ec9c87ebfb5d5304b6f155f577f
Author
Parents
Loading