benchmark
4051d29c - ddp_experiments script: add nccl-socket-ifname arg, fix timeout (#1489)

Commit
2 years ago
ddp_experiments script: add nccl-socket-ifname arg, fix timeout (#1489) Summary: Fixes: https://github.com/pytorch/benchmark/issues/1486. The timeout flag was being set incorrectly. This fixes that. The NCCL_SOCKET_IFNAME was not configurable on the command line. This adds a --nccl-socket-ifname arg. Testing: manually tested with: ``` python userbenchmark/ddp_experiments/__init__.py --ngpus 8 --distributed ddp --nodes 1 --filter_models resnet50 --timeout 15 --nccl-socket-ifname asdf ``` * verified that the slurm job gets killed after ~15 minutes * in the logs, I observe `NCCL INFO NCCL_SOCKET_IFNAME set to asdf` followed by `NCCL WARN Bootstrap : no socket interface found` Pull Request resolved: https://github.com/pytorch/benchmark/pull/1489 Reviewed By: xuzhao9 Differential Revision: D44153156 Pulled By: davidberard98 fbshipit-source-id: 6a4d5911e8ab0ef0243c20d268e56f3247df091c
Author
Parents
Loading