pytorch
f665a7f8 - [pet] Set error code in reply file when child process is terminated by signals.

Commit
3 years ago
[pet] Set error code in reply file when child process is terminated by signals. Summary: Fill reply file's error code with ProcessFailure's exitcode. This is necessary when child process terminated by signals (ex. SIGSEGV). Test Plan: - Buck test ``` buck test mode/dev-nosan pytorch/elastic/torchelastic/distributed/fb/test:launch_test buck test mode/dev-nosan caffe2/torch/distributed/elastic/multiprocessing/errors/fb/test:error_handler_fb_test_needed_coverage ``` - TSM ``` fbpkg build -E torchelastic_distributed_sum buck run mode/dev-nosan //pytorch/elastic/torchelastic/tsm/fb/cli:tsm -- run_ddp --scheduler mast --fbpkg torchelastic_distributed_sum:ecdf31f --nnodes 2 --nproc_per_node 2 --resource T1 --run_cfg hpcIdentity=oncall_dai_pet,hpcClusterUuid=MastNaoTestCluster main.pa ``` https://www.internalfb.com/mast/job/tsm_wilsonhong-torchelastic_distributed_sum_ef3fd8d3 - classy_vision ``` flow-cli canary pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow ``` https://our.intern.facebook.com/intern/fblearner/details/263970380/?notif_channel=cli Reviewed By: tierex Differential Revision: D27512554 fbshipit-source-id: 903d25d96655085685f874113826d4627d9a79e4
Author
Wilson Hong
Parents
Loading