[PyTorch] Speed up Tensor::data_ptr (#53723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53723
We know the size of the data item at compile time. Let's take
advantage of that instead of doing a runtime multiplication by the
data type size. (Presumably, constant propagating through
`data_type.itemsize()` to optimize the `imul` away was just a bridge
too far for clang -- I checked assembly and we went from a
load-and-`imul` to a `lea` that multiplied by constant 4 for
`data_ptr<float>()`.)
ghstack-source-id: 123559924
Test Plan:
Compared `perf stat` output for Mergenet AdIndexer
benchmark before/after this change:
Before:
```
16,943.46 msec task-clock # 0.999 CPUs utilized ( +- 0.16% )
3,771 context-switches # 0.223 K/sec ( +- 15.86% )
3 cpu-migrations # 0.000 K/sec
101,660 page-faults # 0.006 M/sec ( +- 1.00% )
33,519,516,740 cycles # 1.978 GHz ( +- 0.20% ) (49.99%)
68,556,471,199 instructions # 2.05 insn per cycle ( +- 0.08% ) (49.98%)
11,100,415,689 branches # 655.145 M/sec ( +- 0.12% ) (50.02%)
73,082,369 branch-misses # 0.66% of all branches ( +- 0.45% ) (50.01%)
```
After:
```
16,779.99 msec task-clock # 0.999 CPUs utilized ( +- 0.40% )
2,815 context-switches # 0.168 K/sec ( +- 7.89% )
3 cpu-migrations # 0.000 K/sec ( +- 6.25% )
100,124 page-faults # 0.006 M/sec ( +- 0.40% )
33,213,000,715 cycles # 1.979 GHz ( +- 0.39% ) (49.99%)
68,359,270,731 instructions # 2.06 insn per cycle ( +- 0.08% ) (50.00%)
11,058,104,630 branches # 659.005 M/sec ( +- 0.11% ) (50.00%)
72,840,016 branch-misses # 0.66% of all branches ( +- 0.51% ) (49.99%)
```
0.9% cycles win and 0.29% instruction count win, both of which seem to
be outside the noise.
Reviewed By: bhosmer
Differential Revision: D26919110
fbshipit-source-id: 23fab7adcfcf6ec9c87ebfb5d5304b6f155f577f