[static runtime] fuse gather+to+lengths_to_offsets (#64075)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64075
Test Plan:
Before:
`I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987`
After:
`I0826 17:13:07.464485 1040300 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.46362. Iters per second: 154.712`
Profile after: P453143683
Accuracy tested comparing with jit interpreter for no differences under 1e-3 (nnc ops turned on) https://www.internalfb.com/intern/diff/view-version/136824794/
======
With 100-request recordio inputs (211 inputs)
Before:
`I1101 12:43:13.558375 742187 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.7882. Iters per second: 84.8309`
After:
`I1101 13:50:41.087644 1126186 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.6763. Iters per second: 85.6438`
Profile after: P465977010
Constituent ops before (total is 0.5646):
```
0.187392 ms. 1.61737%. fb::clip_ranges_gather (309 nodes, out variant)
0.174101 ms. 1.50266%. fb::lengths_to_offsets (464 nodes, out variant)
0.203126 ms. 1.75317%. static_runtime::to_copy (805 nodes, out variant)
```
Constitutent ops after (total is 0.4985):
```
0.376559 ms. 3.25614%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0614349 ms. 0.531235%. fb::lengths_to_offsets (159 nodes, out variant)
0.0573315 ms. 0.495751%. static_runtime::to_copy (195 nodes, out variant)
0.00325543 ms. 0.0281501%. fb::gather_ranges (4 nodes, out variant)
```
Compare with jit interpreter inside benchmark:
`I1101 13:55:53.013602 1149446 PtVsBlackBoxPredictorBenchLib.cpp:175] Finished comparing PT static runtime and jit interpreter results`
======
Casting on the fly:
a. Static runtime off
```
Static runtime ms per iter: 11.4658. Iters per second: 87.2159
0.220367 ms. 1.94726%. static_runtime::to_copy (805 nodes, out variant)
0.172585 ms. 1.52504%. fb::clip_ranges_gather (309 nodes, out variant)
0.157836 ms. 1.39471%. fb::lengths_to_offsets (464 nodes, out variant)
```
b. Casting on the fly, using explicit allocation+to_copy (which has the fast pass for certain cases, but we'll always call empty):
```
I1115 09:08:35.711972 1925508 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 11.6732. Iters per second: 85.6662
0.599439 ms. 5.25098%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0552475 ms. 0.483958%. fb::lengths_to_offsets (159 nodes, out variant)
0.0576032 ms. 0.504593%. static_runtime::to_copy (195 nodes, out variant)
0.00299026 ms. 0.0261941%. fb::gather_ranges (4 nodes, out variant)
```
c. Casting on the fly with native::to (no explicit allocation, but no fast pass):
```
Static runtime ms per iter: 11.5627. Iters per second: 86.4849
0.454356 ms. 3.9652%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.06315 ms. 0.551115%. static_runtime::to_copy (195 nodes, out variant)
0.0590741 ms. 0.515544%. fb::lengths_to_offsets (159 nodes, out variant)
0.00359182 ms. 0.031346%. fb::clip_ranges_gather (4 nodes, out variant)
```
d. Removal of the to() call in question from the fusion pattern:
```
Static runtime ms per iter: 11.3658. Iters per second: 87.9836
0.29591 ms. 2.6479%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.154612 ms. 1.38352%. static_runtime::to_copy (500 nodes, out variant)
0.0567151 ms. 0.507505%. fb::lengths_to_offsets (159 nodes, out variant)
0.0051115 ms. 0.0457394%. fb::clip_ranges_gather (4 nodes, out variant)
```
Reviewed By: hlu1
Differential Revision: D30515441
fbshipit-source-id: 53acee10619ac2be7dc8982e929e3210c4bb6d21