Fix flaky rref timeout test (#40141)

Commit

4 years ago

Fix flaky rref timeout test (#40141) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141 This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout. The rough sequence of what happens is: 0) Node 0 creates RRef on node 1 with rpc.remote() call 1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection) 2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user 3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field. 4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout. A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`. ghstack-source-id: 106036796 Differential Revision: D22084735 fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2

Author

rohan-varma

Committer

facebook-github-bot

Parents

34e28ede

pytorch f4ffe99d - Fix flaky rref timeout test (#40141)

pytorch
f4ffe99d - Fix flaky rref timeout test (#40141)