Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673)
We'd like to get auto_functionalized to work with AOTInductor. To get
there, we decompose `output = auto_functionalized(inplace_op, ...)` into its
corresponding aten ops (clones + inplace_op) before the Inductor lowering phase.
This decomposition must happen at the end of the Inductor FX passes
because it introduces in-place operations.
The pattern matcher's "replace this single node with multiple nodes" API
isn't robust enough here. The problem is that `auto_functionalized`
returns a single output (this output is a List), but the decomposition
ends up returning the unpacked List (e.g. it may return two tensors).
Previously, there was an assertion that this was not the case; I fixed
up `replace_with_graph` to handle this.
Future: Not all of the clones are necessary (e.g. if the input's last
usage is this operator, then we don't need to clone it). We can add this
logic later.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673
Approved by: https://github.com/oulgen