[Static Runtime] Determine function for `ProcessedNode::run()` statically (#66692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66692
Currently `ProcessedNode::run()` performs 2 dynamic dispatches to decide which function implementation to execute depending on if the function is an out variant / native / or interpreter fallback. Note that this is happening every time an operation is executed by Static Runtime dynamically.
This change makes *that* same decision during module loading time once so that we can remove 1 dynamic dispatch cost at runtime.
**size reduction**
Saving 4 bytes per `ProcessedNode`.
- Before: sizeof(c10::variant<OutVariant, NativeFunction, Operation>):40
- After: sizeof(std::function<void(ProcessedNode*)>): 32 + sizeof(FunctionKind):4 = 36
**latency optimization**
Expected to remove 2 memory loads & 1 conditional jump per `ProcessedNode::run()` execution (needs to be confirmed from compiled binary code).
Ran `ptvsc2_predictor_bench` with `inline_cvr` with 1000 iterations:
- local : 7.56026 -> 7.24794
- local_ro: 1.5799. -> 1.55504.
- remote_ro: 10.6464 -> 10.3017
Test Plan: Ran existing unittests
Reviewed By: swolchok
Differential Revision: D31591785
fbshipit-source-id: 5de83ca386af509381e08ecedf071ee4e9f0f0b0