allgather) is incorrectly reused later (#108811)

Commit View On GitHub

Commit

1 year ago

Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811) For this program: ```python def func(a, *, tag, ranks, group_size): ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size) ar = torch.ops.c10d_functional.wait_tensor(ar) c = torch.relu(a) # c = a d = torch.matmul(c, c) e = d + ar return (e,) ``` the generated code is: ```python def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (4, 4), (4, 1)) with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) # no-op to ensure context buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32) buf0.copy_(arg0_1) #no reuse buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2) buf1 = buf0 buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum')) fun_col_impl._register_tensor_work(buf1, buf1_work) del buf1 buf0 = _wait_tensor(buf0) buf2 = buf0 buf3 = buf0; del buf0 # reuse # Source Nodes: [relu], Original ATen: [aten.relu] stream1 = get_cuda_stream(1) triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1) del arg0_1 buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32) # Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu] extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4) return (buf4, ) ``` We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic. In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective. We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811 Approved by: https://github.com/Chillee, https://github.com/wconstab

Author

pytorchbot

Committer

pytorchmergebot

Parents

54dd65f9

pytorch faa5985d - Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811)

Commit

pytorch
faa5985d - Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811)