[PyTorch] Pass TensorOptions by value (#51165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51165
`TensorOptions` does not have a non-trivial copy, move, or
destroy operation and is small enough to fit in a register, so it
seems like we should pass it by value.
ghstack-source-id: 120697498
Test Plan:
Measured timing for empty framework overhead benchmark before & after this change:
Before:
```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537
2,968.37 msec task-clock # 0.997 CPUs utilized ( +- 0.03% )
250 context-switches # 0.084 K/sec ( +- 2.21% )
1 cpu-migrations # 0.000 K/sec
11,403 page-faults # 0.004 M/sec ( +- 0.28% )
5,898,481,882 cycles # 1.987 GHz ( +- 0.03% ) (50.05%)
16,169,242,938 instructions # 2.74 insn per cycle ( +- 0.03% ) (50.06%)
3,076,546,626 branches # 1036.443 M/sec ( +- 0.05% ) (50.05%)
2,531,859 branch-misses # 0.08% of all branches ( +- 0.89% ) (50.03%)
```
After:
```
I0126 16:23:20.010062 2244624 bench.cpp:139] Mean 0.266814
I0126 16:23:20.010092 2244624 bench.cpp:140] Median 0.265759
I0126 16:23:20.010099 2244624 bench.cpp:141] Min 0.260291
I0126 16:23:20.010107 2244624 bench.cpp:142] stddev 0.00548279
I0126 16:23:20.010118 2244624 bench.cpp:143] stddev / mean 0.0205491
2,983.75 msec task-clock # 0.995 CPUs utilized ( +- 0.36% )
243 context-switches # 0.082 K/sec ( +- 1.26% )
1 cpu-migrations # 0.000 K/sec
11,422 page-faults # 0.004 M/sec ( +- 0.18% )
5,928,639,486 cycles # 1.987 GHz ( +- 0.36% ) (50.02%)
16,105,928,210 instructions # 2.72 insn per cycle ( +- 0.05% ) (50.02%)
3,150,273,453 branches # 1055.809 M/sec ( +- 0.03% ) (50.05%)
3,713,617 branch-misses # 0.12% of all branches ( +- 0.83% ) (50.07%)
```
It looked close to neutral, so I used `perf stat` to confirm it's about a 1% instruction count win.
For deciding whether this stack is worth it, I went back and ran `perf stat` on the baseline diff before I started touching the dispatcher:
```
2,968.37 msec task-clock # 0.997 CPUs utilized ( +- 0.03% )
250 context-switches # 0.084 K/sec ( +- 2.21% )
1 cpu-migrations # 0.000 K/sec
11,403 page-faults # 0.004 M/sec ( +- 0.28% )
5,898,481,882 cycles # 1.987 GHz ( +- 0.03% ) (50.05%)
16,169,242,938 instructions # 2.74 insn per cycle ( +- 0.03% ) (50.06%)
3,076,546,626 branches # 1036.443 M/sec ( +- 0.05% ) (50.05%)
2,531,859 branch-misses # 0.08% of all branches ( +- 0.89% ) (50.03%)
```
If I've done the arithmetic correctly, we have an 0.39% instruction count win.
Reviewed By: ezyang
Differential Revision: D25983863
fbshipit-source-id: 87d1451a01ead25738ea6b80db270d344bc583b2