pytorch
0417da22 - Set a timeout value when testing multiprocess DataLoader (#91476)

Commit
2 years ago
Set a timeout value when testing multiprocess DataLoader (#91476) Setting a timeout value when testing multiprocess DataLoader to prevent ASAN jobs timing out after 4 hours. We are seeing multiple timeout issue running ASAN tests on HUD https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=asan for examples * Without mem leak check enabled https://github.com/pytorch/pytorch/actions/runs/3794216079/jobs/6455118197 * With mem leak check https://github.com/pytorch/pytorch/actions/runs/3792743994/jobs/6449356306 Looking a bit closer into the test, the hanging happens when multiprocess DataLoader is used in `test_utils`. Here is the snapshot of those processes when I log into the hang runner: ``` UID PID PPID C STIME TTY TIME CMD jenkins 1 0 0 Dec28 pts/0 00:00:00 bash jenkins 8 0 0 Dec28 pts/1 00:00:00 sh -c pip install dist/torch-2.0.0a0+git97db9fd-cp37-cp37m-linux_x86_64.whl[opt-einsum] && .jenkins/pytorch/test.sh jenkins 20 8 0 Dec28 pts/1 00:00:00 /bin/bash .jenkins/pytorch/test.sh jenkins 764 20 0 Dec28 pts/1 00:00:07 python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose jenkins 788 764 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6) jenkins 3743 764 0 Dec28 pts/1 00:00:05 /opt/conda/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork jenkins 3766 3743 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3878 3766 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3879 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3880 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3881 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3893 0 0 01:45 pts/2 00:00:00 /bin/bash jenkins 3904 3893 0 01:46 pts/2 00:00:00 ps -ef ``` The specific hanging test was `test_random_seed` which spawned 4 subprocesses to load data. After I killed one of them, the test could continue and printed the following stacktrace: ``` test_random_seed (__main__.TestDataLoaderUtils) ... [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ERROR (9345.840s) test_random_seed (__main__.TestDataLoaderUtils) ... test_random_seed errored - num_retries_left: 3 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.7/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3878) is killed by signal: Terminated. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "test_utils.py", line 469, in test_random_seed x2 = run() File "test_utils.py", line 464, in run return next(iter(dataloader)) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 635, in __next__ data = self._next_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data idx, data = self._get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data success, data = self._try_get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3878) exited unexpectedly [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ok (0.137s) ``` This doesn't fix the issue which I'll need to follow up to see why they hang. However, this should allow the test to terminate gracefully and report errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91476 Approved by: https://github.com/kit1980
Author
Committer
Parents
Loading