[PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55691
Avoiding reference counting for these operations is
roughly a 5% CPU time win vs not supporting borrowing at all.
ghstack-source-id: 127092680
Test Plan:
Existing CI for correctness.
Continued perf stat experiment from previous diff. All results included below for reviewing convenience.
Baseline:
```
Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs):
5,837.13 msec task-clock # 1.000 CPUs utilized ( +- 0.34% )
442 context-switches # 0.076 K/sec ( +- 3.54% )
5 cpu-migrations # 0.001 K/sec ( +- 19.07% )
13,144 page-faults # 0.002 M/sec ( +- 0.39% )
11,597,542,455 cycles # 1.987 GHz ( +- 0.32% ) (50.05%)
30,687,118,071 instructions # 2.65 insn per cycle ( +- 0.03% ) (50.08%)
6,247,677,215 branches # 1070.334 M/sec ( +- 0.04% ) (50.08%)
1,705,403 branch-misses # 0.03% of all branches ( +- 2.16% ) (50.05%)
# Table of individual measurements:
5.9025 (+0.0663) #
5.8276 (-0.0085) #
5.8151 (-0.0210) #
5.7842 (-0.0519) #
5.8511 (+0.0150) #
# Final result:
5.8361 +- 0.0198 seconds time elapsed ( +- 0.34% )
```
Add but don't use borrowing support:
```
Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs):
5,947.20 msec task-clock # 0.999 CPUs utilized ( +- 0.15% )
422 context-switches # 0.071 K/sec ( +- 1.88% )
3 cpu-migrations # 0.001 K/sec ( +- 47.14% )
13,025 page-faults # 0.002 M/sec ( +- 0.46% )
11,814,216,945 cycles # 1.987 GHz ( +- 0.12% ) (50.08%)
31,535,372,676 instructions # 2.67 insn per cycle ( +- 0.06% ) (50.09%)
6,482,809,438 branches # 1090.060 M/sec ( +- 0.04% ) (50.07%)
1,688,623 branch-misses # 0.03% of all branches ( +- 1.62% ) (50.07%)
# Table of individual measurements:
5.97105 (+0.01991) #
5.93649 (-0.01466) #
5.93568 (-0.01547) #
5.95940 (+0.00825) #
5.95310 (+0.00196) #
# Final result:
5.95114 +- 0.00679 seconds time elapsed ( +- 0.11% )
```
Now, use the borrowing support (this diff):
```
Performance counter stats for '/tmp/cpp_benchmark.MakeAddBorrow' (5 runs):
5,528.58 msec task-clock # 1.000 CPUs utilized ( +- 0.33% )
451 context-switches # 0.082 K/sec ( +- 4.29% )
6 cpu-migrations # 0.001 K/sec ( +- 34.65% )
13,155 page-faults # 0.002 M/sec ( +- 0.32% )
10,985,806,260 cycles # 1.987 GHz ( +- 0.33% ) (50.09%)
30,657,224,792 instructions # 2.79 insn per cycle ( +- 0.02% ) (50.07%)
6,247,997,282 branches # 1130.127 M/sec ( +- 0.01% ) (50.04%)
1,732,507 branch-misses # 0.03% of all branches ( +- 1.04% ) (50.06%)
# Table of individual measurements:
5.5626 (+0.0356) #
5.4913 (-0.0357) #
5.5007 (-0.0263) #
5.5839 (+0.0569) #
5.4965 (-0.0305) #
# Final result:
5.5270 +- 0.0192 seconds time elapsed ( +- 0.35% )
```
7.02% cycles improvement vs previous diff
2.78% instructions improvement vs previous diff
5.28% cycles improvement vs baseline
0.1% instructions improvement vs baseline
Note that instructions per cycle improved. This makes sense because we are avoiding memory accesses, and memory accesses manifest as instructions which take 3 (or many more in the case of a cache miss) cycles. This is also a great example of an effect that instruction counting is blind to.
Reviewed By: bhosmer
Differential Revision: D27607295
fbshipit-source-id: 7a0205b4aba6b63febbb5966f0f5e2627815cbbe