pytorch
01842d2b - [PyTorch] Support borrowing in/out Tensors in TensorIterator (#55690)

Commit
3 years ago
[PyTorch] Support borrowing in/out Tensors in TensorIterator (#55690) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55690 Just change `OperandInfo::tensor` and `TensorIteratorConfig::tensors` to hold `c10::MaybeOwned<Tensor>`, and deal with the consequent pointer syntax. Had to C10_ALWAYS_INLINE OperandInfo to preserve existing inlining behavior for whatever compiler-idiosyncratic reason. This is a separate diff from usage to enable measuring the cost of support, and because there is no reason not to send it separately. We probably should not land this without a plan to migrate a lot of TensorIterator use cases to use either borrowing or structured kernels & borrowing. ghstack-source-id: 127092681 Test Plan: Existing CI for correctness. Ran perf stat on existing add in-place C++ benchmark and compared to D27607270 (diff before last; previous diff is arguably part of supporting borrowing). This is a devbig with turbo off. Baseline: ``` Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs): 5,837.13 msec task-clock # 1.000 CPUs utilized ( +- 0.34% ) 442 context-switches # 0.076 K/sec ( +- 3.54% ) 5 cpu-migrations # 0.001 K/sec ( +- 19.07% ) 13,144 page-faults # 0.002 M/sec ( +- 0.39% ) 11,597,542,455 cycles # 1.987 GHz ( +- 0.32% ) (50.05%) 30,687,118,071 instructions # 2.65 insn per cycle ( +- 0.03% ) (50.08%) 6,247,677,215 branches # 1070.334 M/sec ( +- 0.04% ) (50.08%) 1,705,403 branch-misses # 0.03% of all branches ( +- 2.16% ) (50.05%) # Table of individual measurements: 5.9025 (+0.0663) # 5.8276 (-0.0085) # 5.8151 (-0.0210) # 5.7842 (-0.0519) # 5.8511 (+0.0150) # # Final result: 5.8361 +- 0.0198 seconds time elapsed ( +- 0.34% ) ``` Add but don't use borrowing support: ``` Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs): 5,947.20 msec task-clock # 0.999 CPUs utilized ( +- 0.15% ) 422 context-switches # 0.071 K/sec ( +- 1.88% ) 3 cpu-migrations # 0.001 K/sec ( +- 47.14% ) 13,025 page-faults # 0.002 M/sec ( +- 0.46% ) 11,814,216,945 cycles # 1.987 GHz ( +- 0.12% ) (50.08%) 31,535,372,676 instructions # 2.67 insn per cycle ( +- 0.06% ) (50.09%) 6,482,809,438 branches # 1090.060 M/sec ( +- 0.04% ) (50.07%) 1,688,623 branch-misses # 0.03% of all branches ( +- 1.62% ) (50.07%) # Table of individual measurements: 5.97105 (+0.01991) # 5.93649 (-0.01466) # 5.93568 (-0.01547) # 5.95940 (+0.00825) # 5.95310 (+0.00196) # # Final result: 5.95114 +- 0.00679 seconds time elapsed ( +- 0.11% ) ``` 1.87% cycles regression vs baseline 2.76% instructions regression vs baseline Reviewed By: ezyang Differential Revision: D27607293 fbshipit-source-id: 55b9873c15b0de689ae17f9c35eb4ba0d026cade
Author
Parents
Loading