pytorch
b553e691 - [distributed] quicker exit in the case of failed tests in distributed (#34150)

Commit

4 years ago

[distributed] quicker exit in the case of failed tests in distributed (#34150) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150 In the distributed setting we commonly have tests in which there are errors where one process exits but the other do not (since they are for example waiting for work from the process that exited). Currently, when this situation happens we do not handle this well, and wait for process 0 to timeout. This results in wasted time waiting for test errors and a less helpful "Process 0 timed out..." error message when the error was actually something else. This diff fixes the issue by checking for exited subprocesses and terminating the test when we see a subprocess that has exited uncleanly. We still enforce timeouts and return when all processes have exited cleantly in the happy path. ghstack-source-id: 99921462 Test Plan: All distributed tests + tested by writing tests that should trigger the unclean subprocess detection, and verified that we exit quickly instead of waiting for the entire timeout. Differential Revision: D20231032 fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62

Author

rohan-varma

Committer

facebook-github-bot

Parents

2cf576e9

pytorch b553e691 - [distributed] quicker exit in the case of failed tests in distributed (#34150)

pytorch
b553e691 - [distributed] quicker exit in the case of failed tests in distributed (#34150)