[TorchElastic] Handle SystemExit with code == 0 (#119697)
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.
Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
exit_code = 1
if len(sys.argv) > 1:
exit_code = int(sys.argv[1])
sys.exit(exit_code)
if __name__=="__main__":
main()
~~~
Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0
~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜ workspace echo $?
0
~~~
Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
run(args)
File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-12_09:16:25
host : kurman-mbp.dhcp.thefacebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 64668)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~
Differential Revision: D53653874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab