[PyTorch] Support borrowing in/out Tensors in TensorIterator (#55690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55690
Just change `OperandInfo::tensor` and
`TensorIteratorConfig::tensors` to hold `c10::MaybeOwned<Tensor>`, and
deal with the consequent pointer syntax. Had to C10_ALWAYS_INLINE
OperandInfo to preserve existing inlining behavior for whatever
compiler-idiosyncratic reason.
This is a separate diff from usage to enable measuring the cost of
support, and because there is no reason not to send it separately.
We probably should not land this without a plan to migrate a lot of
TensorIterator use cases to use either borrowing or structured kernels
& borrowing.
ghstack-source-id: 127092681
Test Plan:
Existing CI for correctness.
Ran perf stat on existing add in-place C++ benchmark and compared to D27607270 (diff before last; previous diff is arguably part of supporting borrowing). This is a devbig with turbo off.
Baseline:
```
Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs):
5,837.13 msec task-clock # 1.000 CPUs utilized ( +- 0.34% )
442 context-switches # 0.076 K/sec ( +- 3.54% )
5 cpu-migrations # 0.001 K/sec ( +- 19.07% )
13,144 page-faults # 0.002 M/sec ( +- 0.39% )
11,597,542,455 cycles # 1.987 GHz ( +- 0.32% ) (50.05%)
30,687,118,071 instructions # 2.65 insn per cycle ( +- 0.03% ) (50.08%)
6,247,677,215 branches # 1070.334 M/sec ( +- 0.04% ) (50.08%)
1,705,403 branch-misses # 0.03% of all branches ( +- 2.16% ) (50.05%)
# Table of individual measurements:
5.9025 (+0.0663) #
5.8276 (-0.0085) #
5.8151 (-0.0210) #
5.7842 (-0.0519) #
5.8511 (+0.0150) #
# Final result:
5.8361 +- 0.0198 seconds time elapsed ( +- 0.34% )
```
Add but don't use borrowing support:
```
Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs):
5,947.20 msec task-clock # 0.999 CPUs utilized ( +- 0.15% )
422 context-switches # 0.071 K/sec ( +- 1.88% )
3 cpu-migrations # 0.001 K/sec ( +- 47.14% )
13,025 page-faults # 0.002 M/sec ( +- 0.46% )
11,814,216,945 cycles # 1.987 GHz ( +- 0.12% ) (50.08%)
31,535,372,676 instructions # 2.67 insn per cycle ( +- 0.06% ) (50.09%)
6,482,809,438 branches # 1090.060 M/sec ( +- 0.04% ) (50.07%)
1,688,623 branch-misses # 0.03% of all branches ( +- 1.62% ) (50.07%)
# Table of individual measurements:
5.97105 (+0.01991) #
5.93649 (-0.01466) #
5.93568 (-0.01547) #
5.95940 (+0.00825) #
5.95310 (+0.00196) #
# Final result:
5.95114 +- 0.00679 seconds time elapsed ( +- 0.11% )
```
1.87% cycles regression vs baseline
2.76% instructions regression vs baseline
Reviewed By: ezyang
Differential Revision: D27607293
fbshipit-source-id: 55b9873c15b0de689ae17f9c35eb4ba0d026cade