pytorch
9bfb91b5 - Fix possible deadlock in _wait_all_workers (#39535)

Commit View On GitHub

Commit

4 years ago

Fix possible deadlock in _wait_all_workers (#39535) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39535 This is my understanding of what could happen: on workerN (N != 0), `_wait_all_workers_sequence_id_to_states`, which is a `defaultdict`, is accessed twice: once in the body of `_wait_all_workers` (by the "main thread" of workerN) and once in `_set_proceed_shutdown_signal`, called by worker0 through a RPC call. I think the two could race and access the `_wait_all_workers_sequence_id_to_states` at the same time, and thus create two separate copies of `WaitAllWorkersStates`. One of those threads would wait on the event of one copy, but the other thread would set the event of the other copy. This lead to a deadlock, as the main thread would end up waiting forever. ghstack-source-id: 105283327 Test Plan: I added additional logging in those functions, ran a stress test of the RPC test suite, based on the logs I suspected that this could be the issue, fixed it and re-run the stress test and didn't see the bug anymore. This is admittedly not very convincing evidence, as I may just have been lucky that second time... Differential Revision: D21889752 fbshipit-source-id: 05ec710bd2930313e1480ae896b4b2f5f503aa17

Author

Committer

facebook-github-bot

Parents

8a6914dd

pytorch 9bfb91b5 - Fix possible deadlock in _wait_all_workers (#39535)

Commit

pytorch
9bfb91b5 - Fix possible deadlock in _wait_all_workers (#39535)