Add single-process DDP accuracy support to dynamo benchmark suite (#88511)
- does not intend to support multi-process, as that is more complex
and we have torchbench scripts for that
- currently only works in accuracy mode as this was the main goal,
but could be extended for measuring single-gpu perf impact of
graph breaks
Run with
`python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp`
Example output
```
cuda train hf_Bert
[2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding
PASS
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511
Approved by: https://github.com/davidberard98, https://github.com/aazzolini