Add fp16 amp mode support for all models (#776)
Summary:
This PR adds fp16 amp to all models. Basically, it adds an autocast context to all eval tests:
```
with torch.cuda.amp.autocast():
eval()
```
I have the following concerns regarding to the current amp mode:
1. Some models don't support it. Example: BERT_pytorch, attention_is_all_you_need_pytorch
Reproduction:
```
$ python run.py BERT_pytorch -d cuda --fp16 amp
Running eval method from BERT_pytorch on cuda in eager mode.
File "/fsx/users/xzhao9/benchmark/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/attention/single.py", line 19, in forward
scores = scores.masked_fill(mask == 0, -1e9)
RuntimeError: value cannot be converted to type at::Half without overflow
```
2. Some models don't return correct result. Example: dlrm, moco, pyhpc_turbulent_kinetic_energy
Reproduction:
```
$ python run.py pyhpc_turbulent_kinetic_energy -d cuda --fp16 amp
Running eval method from pyhpc_turbulent_kinetic_energy on cuda in eager mode.
GPU Time: 7.316 milliseconds
CPU Dispatch Time: 7.251 milliseconds
CPU Total Wall Time: 7.350 milliseconds
Correctness: 0.000000000000000
```
3. About 2/3 models slightly regress in performance in amp mode. Example: squeezenet1_1, alexnet
Reproduction:
```
$ python run.py alexnet -d cuda --fp16 amp
Running eval method from alexnet on cuda in eager mode.
GPU Time: 1.475 milliseconds
CPU Dispatch Time: 1.305 milliseconds
CPU Total Wall Time: 1.509 milliseconds
Correctness: 0.999999880790710
```
```
$ python run.py alexnet -d cuda --fp16 no
Running eval method from alexnet on cuda in eager mode.
GPU Time: 1.095 milliseconds
CPU Dispatch Time: 0.994 milliseconds
CPU Total Wall Time: 1.126 milliseconds
```
The slowdown is 0.74x.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/776
Reviewed By: ejguan
Differential Revision: D34559508
Pulled By: xuzhao9
fbshipit-source-id: cf585aac5e5eaedbcdca9e8292420a8beae82481