pytorch
7bd2014e - [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)

Commit View On GitHub

Commit

4 years ago

[resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650 Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling). This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout. This diff does not address: 1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic. 2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called Added unit tests to confirm the current behavior ghstack-source-id: 102622601 Test Plan: Added unit tests in rpc_test Differential Revision: D20376953 fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47

Author

rohan-varma

Committer

facebook-github-bot

Parents

b0ee6c70

pytorch 7bd2014e - [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)

Commit

pytorch
7bd2014e - [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)