pytorch
f71a0dae - Use faulthandler to dump traceback of timed out processes in unit tests. (#54818)

Commit
3 years ago
Use faulthandler to dump traceback of timed out processes in unit tests. (#54818) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54818 Several flaky tests fail due to some sort of timeout and it isn't clear from the error message in CI where exactly each process is stuck. In this PR, I've added mechanism to dump the entire python traceback of all python threads when we encounter a timeout. Example traceback: ``` Process 3 timed out with traceback: Current thread 0x00007ff3363ff700 (most recent call first): File "torch/testing/_internal/common_distributed.py", line 373 in _event_listener File "threading.py", line 870 in run File "threading.py", line 932 in _bootstrap_inner File "threading.py", line 890 in _bootstrap Thread 0x00007ff406132180 (most recent call first): File "torch/distributed/distributed_c10d.py", line 2477 in barrier File "torch/testing/_internal/distributed/rpc/rpc_test.py", line 838 in test_reinit File "torch/testing/_internal/dist_utils.py", line 90 in new_test_method File "torch/testing/_internal/common_distributed.py", line 292 in wrapper File "torch/testing/_internal/common_distributed.py", line 409 in run_test File "torch/testing/_internal/common_distributed.py", line 393 in _run File "multiprocessing/process.py", line 108 in run File "multiprocessing/process.py", line 315 in _bootstrap File "multiprocessing/popen_fork.py", line 75 in _launch File "multiprocessing/popen_fork.py", line 19 in __init__ File "multiprocessing/context.py", line 277 in _Popen File "multiprocessing/process.py", line 121 in start ``` ghstack-source-id: 125323810 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D27378764 fbshipit-source-id: 661c009a5458c724f004aa83de9347a4bc03b63e
Author
Parents
Loading