[PyTorch] Save a single add instruction in the dispatcher (#52543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52543
This saves one (1) add instruction. New code comments should
explain exactly why. In short, we store a direct pointer in
`OperatorHandle` in addition to the `std::list<OperatorDef>::iterator`
because converting the latter to the former requires an add instruction.
It is not clear to me whether this is a particularly great tradeoff,
but I spent (more) time on it (than I expected), so here it is for
review.
ghstack-source-id: 122147199
Test Plan:
Inspect assembly for at::empty in benchmark code -- see add
instruction disappeared.
Compare empty benchmark performance to baseline with perf stat.
Baseline:
5,077.43 msec task-clock # 1.000 CPUs utilized ( +- 0.25% )
405 context-switches # 0.080 K/sec ( +- 1.37% )
3 cpu-migrations # 0.001 K/sec ( +- 18.22% )
12,259 page-faults # 0.002 M/sec ( +- 0.10% )
10,089,754,343 cycles # 1.987 GHz ( +- 0.25% ) (50.04%)
29,516,000,227 instructions # 2.93 insn per cycle ( +- 0.04% ) (50.08%)
5,662,629,032 branches # 1115.256 M/sec ( +- 0.02% ) (50.08%)
1,955,729 branch-misses # 0.03% of all branches ( +- 0.88% ) (50.04%)
5.0796 +- 0.0128 seconds time elapsed ( +- 0.25% )
After:
```
5,017.77 msec task-clock # 1.001 CPUs utilized ( +- 0.19% )
400 context-switches # 0.080 K/sec ( +- 3.09% )
4 cpu-migrations # 0.001 K/sec ( +- 46.91% )
12,240 page-faults # 0.002 M/sec ( +- 0.37% )
9,960,189,535 cycles # 1.985 GHz ( +- 0.19% ) (50.02%)
29,467,149,773 instructions # 2.96 insn per cycle ( +- 0.11% ) (50.03%)
5,661,074,219 branches # 1128.206 M/sec ( +- 0.02% ) (50.07%)
2,032,712 branch-misses # 0.04% of all branches ( +- 1.35% ) (50.07%)
5.0151 +- 0.0101 seconds time elapsed ( +- 0.20% )
```
1.2% cycles win, outside the noise
0.16% instruction count win, barely outside noise
I am surprised at the size of the cycles win.
Reviewed By: bhosmer
Differential Revision: D26564192
fbshipit-source-id: 71f731ba54ec1cb407673db691eaf77a257de4a9