[distributed] quicker exit in the case of failed tests in distributed (#34150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34150
In the distributed setting we commonly have tests in which there are errors where one process
exits but the other do not (since they are for example waiting for work from
the process that exited). Currently, when this situation happens we do not
handle this well, and wait for process 0 to timeout. This results in wasted
time waiting for test errors and a less helpful "Process 0 timed out..." error
message when the error was actually something else.
This diff fixes the issue by checking for exited subprocesses and terminating
the test when we see a subprocess that has exited uncleanly. We still enforce
timeouts and return when all processes have exited cleantly in the happy path.
ghstack-source-id: 99921462
Test Plan:
All distributed tests + tested by writing tests that should trigger
the unclean subprocess detection, and verified that we exit quickly instead of
waiting for the entire timeout.
Differential Revision: D20231032
fbshipit-source-id: 3e0d4a20925b7d1098ec4c40ffcc66845425dd62