Pass child error to parent in distributed tests. (#52632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52632
Distributed tests run in a multiprocessing environment, where a parent
process drives the tests through several child processes. As a result, when a
child process fails the parent only prints the following:
```
Process 0 exited with error code 10
```
The child process also logs its own exception, but it is cumberson to go
through the logs and track this down.
To alleviate this, I've added a bunch of pipes for each child process so that
the child process writes the error to the pipe before exiting and the parent
process can read the appropriate error from the pipe and display it.
The new output printed by the parent is as follows:
```
> RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "torch/testing/_internal/common_distributed.py", line 361, in _run
getattr(self, test_name)()
File "torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "test_c10d.py", line 789, in test_broadcast_checks
pg.broadcast([t1], opts)
ValueError: ProcessGroupGloo::broadcast: invalid root rank: -1
```
ghstack-source-id: 122273793
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26589274
fbshipit-source-id: 7b7a71ec790b216a89db7c157377f426531349a5