Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811)
For this program:
```python
def func(a, *, tag, ranks, group_size):
ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size)
ar = torch.ops.c10d_functional.wait_tensor(ar)
c = torch.relu(a)
# c = a
d = torch.matmul(c, c)
e = d + ar
return (e,)
```
the generated code is:
```python
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (4, 4), (4, 1))
with torch.cuda._DeviceGuard(1):
torch.cuda.set_device(1) # no-op to ensure context
buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
buf0.copy_(arg0_1) #no reuse
buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2)
buf1 = buf0
buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum'))
fun_col_impl._register_tensor_work(buf1, buf1_work)
del buf1
buf0 = _wait_tensor(buf0)
buf2 = buf0
buf3 = buf0; del buf0 # reuse
# Source Nodes: [relu], Original ATen: [aten.relu]
stream1 = get_cuda_stream(1)
triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1)
del arg0_1
buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
# Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu]
extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4)
return (buf4, )
```
We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic.
In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective.
We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811
Approved by: https://github.com/Chillee, https://github.com/wconstab