Add option to log subprocess output to files in DDP launcher. (#33193)
Summary:
Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.
The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.
To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.
Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193
Reviewed By: gchanan
Differential Revision: D24496013
Pulled By: rohan-varma
fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68