Support DDP in the core model set (#1031)
Summary:
Example command:
```
python run_benchmark.py distributed --ngpus 8 --nodes 1 --model torchbenchmark.models.hf_Bert.Model --trainer torchbenchmark.util.distributed.core_model.trainer.Trainer --distributed ddp --job_dir $PWD/.userbenchmark/distributed/logs_eager
```
Output:
```
{
"name": "distributed",
"environ": {
"pytorch_git_version": "5728ca13aef459e71cee062eb872ab217dfa5742"
},
"args": {
"ngpus": 8,
"nodes": 1,
"timeout": 1440,
"profiler": false,
"partition": "train",
"cluster": null,
"job_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_eager",
"model": "torchbenchmark.models.hf_Bert.Model",
"trainer": "torchbenchmark.util.distributed.core_model.trainer.Trainer",
"distributed": "ddp",
"dist_url": "file:///data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_eager/f4960836279846a88e9bee2202fb226e_init",
"output_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_eager"
},
"metrics": {
"0-latency_median": 375.4787841796875,
"0-latency_stdev": 0.5074592125167986,
"1-latency_median": 375.49486389160154,
"1-latency_stdev": 0.5890401056068594,
"2-latency_median": 375.4880035400391,
"2-latency_stdev": 0.5707920820804092,
"3-latency_median": 375.48769226074216,
"3-latency_stdev": 0.5835954020419492,
"4-latency_median": 375.4671112060547,
"4-latency_stdev": 0.49707192777934556,
"5-latency_median": 375.49219970703126,
"5-latency_stdev": 0.5600655620421927,
"6-latency_median": 375.4905609130859,
"6-latency_stdev": 0.5482310737142803,
"7-latency_median": 375.4790863037109,
"7-latency_stdev": 0.5190043980938861
}
}
```
Example command 2:
```
python run_benchmark.py distributed --ngpus 8 --nodes 1 --model torchbenchmark.models.hf_Bert.Model --trainer torchbenchmark.util.distributed.core_model.trainer.Trainer --distributed ddp --torchdynamo aot_autograd_speedup_strategy --job_dir $PWD/.userbenchmark/distributed/logs_torchdynamo
```
Output:
```
{
"name": "distributed",
"environ": {
"pytorch_git_version": "5728ca13aef459e71cee062eb872ab217dfa5742"
},
"args": {
"ngpus": 8,
"nodes": 1,
"timeout": 1440,
"profiler": false,
"partition": "train",
"cluster": null,
"job_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_torchdynamo",
"model": "torchbenchmark.models.hf_Bert.Model",
"trainer": "torchbenchmark.util.distributed.core_model.trainer.Trainer",
"distributed": "ddp",
"dist_url": "file:///data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_torchdynamo/92b3d2fbd2984cf3aa620e790edf280d_init",
"output_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs_torchdynamo"
},
"metrics": {
"0-latency_median": 362.51340637207034,
"0-latency_stdev": 2.6765847673834227,
"1-latency_median": 362.52191162109375,
"1-latency_stdev": 2.6585795546458573,
"2-latency_median": 362.55426330566405,
"2-latency_stdev": 2.641738105760016,
"3-latency_median": 362.5112548828125,
"3-latency_stdev": 2.6668644262567915,
"4-latency_median": 362.547509765625,
"4-latency_stdev": 2.6403463499953577,
"5-latency_median": 362.5440246582031,
"5-latency_stdev": 3.224301047544871,
"6-latency_median": 362.5065460205078,
"6-latency_stdev": 2.6760710391195124,
"7-latency_median": 362.5503784179688,
"7-latency_stdev": 2.6452712921359853
}
}
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1031
Reviewed By: FindHao
Differential Revision: D37888709
Pulled By: xuzhao9
fbshipit-source-id: fd145185c12a65eb41de8bd7ee34984b09c904e0