benchmark
d2ebc396 - ddp experiments: run all measurements on the same allocation (#1245)

Commit
3 years ago
ddp experiments: run all measurements on the same allocation (#1245) Summary: Previously, we would launch a different slurm job for measurement, e.g.: 1) resnet50 w/ inductor + graph breaks, 2 node 2) resnet50 w/ inductor + NO graph breaks, 2 node 3) ... But there's a of variation between different nodes - possibly due to network topology etc. These slurm jobs would often each be submitted to a different set of nodes, which would add a lot of noise to the data. So instead, with this PR, we do the following: * allocate enough nodes to run all the measurements * launch (8 * max_nodes) jobs, and provide a list of measurements to each of the jobs * in each job, we iterate through the list of measurements. For each measurement, we spawn a new process that runs the measurement. * Once the new process exits, we synchronize via a barrier that's implemented via torch.distributed's FileStore. * Note that if we have, say, 8 nodes, and one measurement only requires 4 nodes, then the other 4 nodes won't launch any jobs and will just sit idle waiting on the barrier. Typical usage: ``` python userbenchmark/ddp_experiments/__init__.py --job_dir /full/path/to/shared/directory ``` then the results will be dumped in `/full/path/to/shared/directory`. From there you can use `userbenchmark/ddp_experiments/parse_ddp.py` to view the results (follow up PR coming soon). Other options include: ``` --repeat [n] : repeats the experiments n times each so you can reduce the effect of any transient load on the machines or network --profile True : turn on the profiler (note that for inductor, you need to turn off cuda graphs for the profiler to work) --exclude node1,node2 : exclude a comma-separated list of nodes from the slurm allocation. --timeout 600 : timeout in minutes ``` Pull Request resolved: https://github.com/pytorch/benchmark/pull/1245 Reviewed By: wconstab Differential Revision: D40452953 Pulled By: davidberard98 fbshipit-source-id: 52ed56642d2bf335312fb792dbda07d8960047e9
Author
Parents
Loading