Add the hf_bert E2E benchmark (#771)
Summary:
This PR adds the first end-to-end workload, hf_bert, to the suite that:
- Supports both train and inference
- By default, uses `amp.autocast()` to do fp16 train/inference
- Currently, report latency and qps as performance metrics
- Doesn't support multi-GPU workload yet (will support in the future)
To run the benchmark, use: `python run_e2e.py hf_bert -t eval --fp16 [no|amp]`. For example, on A100:
```
$ python run_e2e.py hf_bert -t eval
{"device": "cuda", "device_num": 1, "test": "eval", "num_examples": 1043, "batch_size": 1, "result": {"latency": 14.56970322, "qps": 71.58690772563314}}
$ python run_e2e.py hf_bert -t train
{"device": "cuda", "device_num": 1, "test": "train", "num_examples": 8576, "batch_size": 32, "result": {"latency": 36.95959081, "qps": 232.03720095514768}}
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/771
Reviewed By: erichan1
Differential Revision: D34529471
Pulled By: xuzhao9
fbshipit-source-id: a9f8b43c9e4e4ff30dfd76c1c88fe3948976fbd2