[Inductor] Enable fusion of mutation ops in narrow cases (#94110)
Currently we don't enable fusion of mutation ops in any case (we introduce a `StarDep` to prevent fusion with any upstream readers, to ensure the kernel mutating the buffer is executing after them).
This results in cases like [this](https://gist.github.com/mlazos/3dcfd416033b3459ffea43cb91c117c9) where even though all of the other readers have been fused into a single kernel, the `copy_` is left by itself.
This PR introduces `WeakDep` and a pass after each fusion to see if after fusion there are other dependencies on the upstream fused node which already guarantee that this kernel is fused after the prior readers, if there are, the `WeakDep` is pruned and the kernel performing the mutation can be fused with the upstream kernel. This will allow Inductor to fuse epilogue `copy_`s introduced by functionalization on inference graphs.
[before code](https://gist.github.com/mlazos/3369a11dfd1b5cf5bb255313b710ef5b)
[after code](https://gist.github.com/mlazos/1005d8aeeba56e3a3e1b70cd77773c53)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94110
Approved by: https://github.com/jansel