pytorch
01b30922 - [static runtime] fuse gather+to+lengths_to_offsets (#64075)

Commit
4 years ago
[static runtime] fuse gather+to+lengths_to_offsets (#64075) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64075 Test Plan: Before: `I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987` After: `I0826 17:13:07.464485 1040300 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.46362. Iters per second: 154.712` Profile after: P453143683 Accuracy tested comparing with jit interpreter for no differences under 1e-3 (nnc ops turned on) https://www.internalfb.com/intern/diff/view-version/136824794/ ====== With 100-request recordio inputs (211 inputs) Before: `I1101 12:43:13.558375 742187 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.7882. Iters per second: 84.8309` After: `I1101 13:50:41.087644 1126186 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.6763. Iters per second: 85.6438` Profile after: P465977010 Constituent ops before (total is 0.5646): ``` 0.187392 ms. 1.61737%. fb::clip_ranges_gather (309 nodes, out variant) 0.174101 ms. 1.50266%. fb::lengths_to_offsets (464 nodes, out variant) 0.203126 ms. 1.75317%. static_runtime::to_copy (805 nodes, out variant) ``` Constitutent ops after (total is 0.4985): ``` 0.376559 ms. 3.25614%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant) 0.0614349 ms. 0.531235%. fb::lengths_to_offsets (159 nodes, out variant) 0.0573315 ms. 0.495751%. static_runtime::to_copy (195 nodes, out variant) 0.00325543 ms. 0.0281501%. fb::gather_ranges (4 nodes, out variant) ``` Compare with jit interpreter inside benchmark: `I1101 13:55:53.013602 1149446 PtVsBlackBoxPredictorBenchLib.cpp:175] Finished comparing PT static runtime and jit interpreter results` ====== Casting on the fly: a. Static runtime off ``` Static runtime ms per iter: 11.4658. Iters per second: 87.2159 0.220367 ms. 1.94726%. static_runtime::to_copy (805 nodes, out variant) 0.172585 ms. 1.52504%. fb::clip_ranges_gather (309 nodes, out variant) 0.157836 ms. 1.39471%. fb::lengths_to_offsets (464 nodes, out variant) ``` b. Casting on the fly, using explicit allocation+to_copy (which has the fast pass for certain cases, but we'll always call empty): ``` I1115 09:08:35.711972 1925508 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 11.6732. Iters per second: 85.6662 0.599439 ms. 5.25098%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant) 0.0552475 ms. 0.483958%. fb::lengths_to_offsets (159 nodes, out variant) 0.0576032 ms. 0.504593%. static_runtime::to_copy (195 nodes, out variant) 0.00299026 ms. 0.0261941%. fb::gather_ranges (4 nodes, out variant) ``` c. Casting on the fly with native::to (no explicit allocation, but no fast pass): ``` Static runtime ms per iter: 11.5627. Iters per second: 86.4849 0.454356 ms. 3.9652%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant) 0.06315 ms. 0.551115%. static_runtime::to_copy (195 nodes, out variant) 0.0590741 ms. 0.515544%. fb::lengths_to_offsets (159 nodes, out variant) 0.00359182 ms. 0.031346%. fb::clip_ranges_gather (4 nodes, out variant) ``` d. Removal of the to() call in question from the fusion pattern: ``` Static runtime ms per iter: 11.3658. Iters per second: 87.9836 0.29591 ms. 2.6479%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant) 0.154612 ms. 1.38352%. static_runtime::to_copy (500 nodes, out variant) 0.0567151 ms. 0.507505%. fb::lengths_to_offsets (159 nodes, out variant) 0.0051115 ms. 0.0457394%. fb::clip_ranges_gather (4 nodes, out variant) ``` Reviewed By: hlu1 Differential Revision: D30515441 fbshipit-source-id: 53acee10619ac2be7dc8982e929e3210c4bb6d21
Author
Parents
Loading