ddp experiments: run all measurements on the same allocation (#1245)
Summary:
Previously, we would launch a different slurm job for measurement, e.g.:
1) resnet50 w/ inductor + graph breaks, 2 node
2) resnet50 w/ inductor + NO graph breaks, 2 node
3) ...
But there's a of variation between different nodes - possibly due to network topology etc. These slurm jobs would often each be submitted to a different set of nodes, which would add a lot of noise to the data.
So instead, with this PR, we do the following:
* allocate enough nodes to run all the measurements
* launch (8 * max_nodes) jobs, and provide a list of measurements to each of the jobs
* in each job, we iterate through the list of measurements. For each measurement, we spawn a new process that runs the measurement.
* Once the new process exits, we synchronize via a barrier that's implemented via torch.distributed's FileStore.
* Note that if we have, say, 8 nodes, and one measurement only requires 4 nodes, then the other 4 nodes won't launch any jobs and will just sit idle waiting on the barrier.
Typical usage:
```
python userbenchmark/ddp_experiments/__init__.py --job_dir /full/path/to/shared/directory
```
then the results will be dumped in `/full/path/to/shared/directory`. From there you can use `userbenchmark/ddp_experiments/parse_ddp.py` to view the results (follow up PR coming soon). Other options include:
```
--repeat [n] : repeats the experiments n times each so you can reduce the effect of any transient load on the machines or network
--profile True : turn on the profiler (note that for inductor, you need to turn off cuda graphs for the profiler to work)
--exclude node1,node2 : exclude a comma-separated list of nodes from the slurm allocation.
--timeout 600 : timeout in minutes
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1245
Reviewed By: wconstab
Differential Revision: D40452953
Pulled By: davidberard98
fbshipit-source-id: 52ed56642d2bf335312fb792dbda07d8960047e9