Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818
Several flaky tests fail due to some sort of timeout and it isn't
clear from the error message in CI where exactly each process is stuck. In this
PR, I've added mechanism to dump the entire python traceback of all python
threads when we encounter a timeout.
Example traceback:
```
Process 3 timed out with traceback:
Current thread 0x00007ff3363ff700 (most recent call first):
File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener
File "threading.py", line 870 in run
File "threading.py", line 932 in _bootstrap_inner
File "threading.py", line 890 in _bootstrap
Thread 0x00007ff406132180 (most recent call first):
File "torch/distributed/distributed_c10d.py", line 2477 in barrier
File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit
File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method
File "torch/testing/_internal/common_distributed.py", line 292 in wrapper
File "torch/testing/_internal/common_distributed.py", line 409 in run_test
File "torch/testing/_internal/common_distributed.py", line 393 in _run
File "multiprocessing/process.py", line 108 in run
File "multiprocessing/process.py", line 315 in _bootstrap
File "multiprocessing/popen_fork.py", line 75 in _launch
File "multiprocessing/popen_fork.py", line 19 in __init__
File "multiprocessing/context.py", line 277 in _Popen
File "multiprocessing/process.py", line 121 in start
```
ghstack-source-id: 125323810
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27378764
fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e