[AOTDispatch] Return mutated inputs directly when keeping mutations (#120514)
Fixes #120242
The example from the issue now results in the graph
```python
def forward(self, arg0_1, arg1_1):
sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None
return (copy_,)
```
and the corresponding inductor kernel eliminates the intermediate buffer
completely
```python
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (5, ), (1, ))
assert_size_stride(arg1_1, (5, ), (1, ))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
# Source Nodes: [sin], Original ATen: [aten.sin]
stream0 = get_raw_stream(0)
triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0)
del arg0_1
return (arg1_1, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120514
Approved by: https://github.com/ezyang, https://github.com/oulgen, https://github.com/lezcano