Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.
Fixes #98467
Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).
Fusion is still slow - but at least finishes.
After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s
Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen