[PT2 Inference] Prototype of Inference Runtime (#108482)
Summary:
This diff demonstrates a simplified E2E workflow for PT2 Inference stack:
1. Model author with `torch.export()`
2. Model processing with `aot_inductor.compile()`
3. Model served with a new Inference Runtime API, named `ModelRunner`
`torch.export()` and `aot_inductor.compile()` produces a zip file using `PyTorchStreamWriter`.
Runtime reads the zip file with `PyTorchStreamReader`.
The zip file contains
{F1080328179}
More discussion on packaging can be found in https://docs.google.com/document/d/1C-4DP5yu7ZhX1aB1p9JcVZ5TultDKObM10AqEtmZ-nU/edit?usp=sharing
Runtime can now switch between two Execution modes:
1. Graph Interpreter mode, implemented based on Sigmoid's Executor
2. AOTInductor mode, implemented based on FBAOTInductorModel
Test Plan:
buck2 run mode/dev-nosan mode/inplace -c fbcode.enable_gpu_sections=True //sigmoid/inference/test:e2e_test
Export and Lower with AOTInductor
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/inference:export_package
Run with GraphInterpreter and AOTInducotr
buck2 run mode/dev-nosan //sigmoid/inference:main
Reviewed By: suo
Differential Revision: D47781098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108482
Approved by: https://github.com/zhxchen17