pytorch
f58cc4b4 - [RPC] Fix flaky test by waiting for async rref calls (#39012)

Commit
4 years ago
[RPC] Fix flaky test by waiting for async rref calls (#39012) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012 The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself. What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail: ``` self.assertEqual(2, int(info["num_owner_rrefs"])) ``` because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed. The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier. ghstack-source-id: 104810738 Test Plan: The test isn't flaky anymore. Differential Revision: D21716070 fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
Author
lw lw
Parents
Loading