pytorch
6032ea03 - [PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691)

Commit View On GitHub

Commit

3 years ago

[PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55691 Avoiding reference counting for these operations is roughly a 5% CPU time win vs not supporting borrowing at all. ghstack-source-id: 127092680 Test Plan: Existing CI for correctness. Continued perf stat experiment from previous diff. All results included below for reviewing convenience. Baseline: ``` Performance counter stats for '/tmp/cpp_benchmark.MaybeOwnedBaselineD27607270' (5 runs): 5,837.13 msec task-clock # 1.000 CPUs utilized ( +- 0.34% ) 442 context-switches # 0.076 K/sec ( +- 3.54% ) 5 cpu-migrations # 0.001 K/sec ( +- 19.07% ) 13,144 page-faults # 0.002 M/sec ( +- 0.39% ) 11,597,542,455 cycles # 1.987 GHz ( +- 0.32% ) (50.05%) 30,687,118,071 instructions # 2.65 insn per cycle ( +- 0.03% ) (50.08%) 6,247,677,215 branches # 1070.334 M/sec ( +- 0.04% ) (50.08%) 1,705,403 branch-misses # 0.03% of all branches ( +- 2.16% ) (50.05%) # Table of individual measurements: 5.9025 (+0.0663) # 5.8276 (-0.0085) # 5.8151 (-0.0210) # 5.7842 (-0.0519) # 5.8511 (+0.0150) # # Final result: 5.8361 +- 0.0198 seconds time elapsed ( +- 0.34% ) ``` Add but don't use borrowing support: ``` Performance counter stats for '/tmp/cpp_benchmark.MeasureMaybeOwnedCost' (5 runs): 5,947.20 msec task-clock # 0.999 CPUs utilized ( +- 0.15% ) 422 context-switches # 0.071 K/sec ( +- 1.88% ) 3 cpu-migrations # 0.001 K/sec ( +- 47.14% ) 13,025 page-faults # 0.002 M/sec ( +- 0.46% ) 11,814,216,945 cycles # 1.987 GHz ( +- 0.12% ) (50.08%) 31,535,372,676 instructions # 2.67 insn per cycle ( +- 0.06% ) (50.09%) 6,482,809,438 branches # 1090.060 M/sec ( +- 0.04% ) (50.07%) 1,688,623 branch-misses # 0.03% of all branches ( +- 1.62% ) (50.07%) # Table of individual measurements: 5.97105 (+0.01991) # 5.93649 (-0.01466) # 5.93568 (-0.01547) # 5.95940 (+0.00825) # 5.95310 (+0.00196) # # Final result: 5.95114 +- 0.00679 seconds time elapsed ( +- 0.11% ) ``` Now, use the borrowing support (this diff): ``` Performance counter stats for '/tmp/cpp_benchmark.MakeAddBorrow' (5 runs): 5,528.58 msec task-clock # 1.000 CPUs utilized ( +- 0.33% ) 451 context-switches # 0.082 K/sec ( +- 4.29% ) 6 cpu-migrations # 0.001 K/sec ( +- 34.65% ) 13,155 page-faults # 0.002 M/sec ( +- 0.32% ) 10,985,806,260 cycles # 1.987 GHz ( +- 0.33% ) (50.09%) 30,657,224,792 instructions # 2.79 insn per cycle ( +- 0.02% ) (50.07%) 6,247,997,282 branches # 1130.127 M/sec ( +- 0.01% ) (50.04%) 1,732,507 branch-misses # 0.03% of all branches ( +- 1.04% ) (50.06%) # Table of individual measurements: 5.5626 (+0.0356) # 5.4913 (-0.0357) # 5.5007 (-0.0263) # 5.5839 (+0.0569) # 5.4965 (-0.0305) # # Final result: 5.5270 +- 0.0192 seconds time elapsed ( +- 0.35% ) ``` 7.02% cycles improvement vs previous diff 2.78% instructions improvement vs previous diff 5.28% cycles improvement vs baseline 0.1% instructions improvement vs baseline Note that instructions per cycle improved. This makes sense because we are avoiding memory accesses, and memory accesses manifest as instructions which take 3 (or many more in the case of a cache miss) cycles. This is also a great example of an effect that instruction counting is blind to. Reviewed By: bhosmer Differential Revision: D27607295 fbshipit-source-id: 7a0205b4aba6b63febbb5966f0f5e2627815cbbe

Author

swolchok

Committer

facebook-github-bot

Parents

01842d2b

pytorch 6032ea03 - [PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691)

Commit

pytorch
6032ea03 - [PyTorch] Migrate add operators to borrow in TensorIteratorBase (#55691)