[inductor] let codegen not rely on node order (#107320)
We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done).
Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a):
```
benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed
```
Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320
Approved by: https://github.com/peterbell10, https://github.com/jansel