[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934
This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113
Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.
Previous diff, CMF local:
```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```
This diff, CMF local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```
Perf looks neutral; see next diff (D32216573) test plan for details.
Reviewed By: hlu1
Differential Revision: D32190751
fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc