pytorch
b08309ee - (torch/elastic) skip logging structured error info if error_file is not set (#73477)

Commit
2 years ago
(torch/elastic) skip logging structured error info if error_file is not set (#73477) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477 resolves https://github.com/pytorch/pytorch/issues/73465 This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`. Test Plan: Induce an error on the agent process by `kill -15 $AGENT_PID` ``` python -m torch.distributed.run \ --nproc_per_node 2 \ --nnodes 1:1 \ --rdzv_backend c10d \ --rdzv_endpoint localhost:29500 \ --monitor_interval 3 test.py ``` Produces {F704936697} In contrast to the duplicated error before: {F704936729} Reviewed By: d4l3k Differential Revision: D34501852 fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8 (cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
Author
Kiuk Chung
Committer
Parents
Loading