Fix BERT_pytorch model on bf16. (#2185)
Summary:
The `get_module()` impl of BERT_pytorch is buggy because it only returns partial computation involved in `train()` and `eval()`. As a result, when running in `bf16` precision, only part of the model are converted to `bf16` and it does not work well with the rest of the model running in `eval()`.
This fix will return the entire model in `get_module()` and fix the bug when running with bf16 precision for both eager and pt2 mode.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2185
Test Plan:
```
$ python run.py BERT_pytorch -d cuda --precision bf16 --torchdynamo inductor
Running eval method from BERT_pytorch on cuda in dynamo inductor mode with input batch size 32 and precision bf16.
GPU Time per batch: 14.625 milliseconds
CPU Wall Time per batch: 14.663 milliseconds
CPU Wall Time: 14.663 milliseconds
Time to first batch: 3477.1816 ms
GPU 0 Peak Memory: 3.8965 GB
CPU Peak Memory: 0.7637 GB
PT2 Compilation time: 33.701 seconds
```
Reviewed By: HDCharles
Differential Revision: D54621014
Pulled By: xuzhao9
fbshipit-source-id: abfeae48c92f0d4b437c8111e7f1e3a37e088876