Add TFLOPS measure for resnet50 model (#668)
Summary:
Eager result with batch size 128 on V100:
```
$ python run.py resnet50 -t eval -d cuda --flops --bs 128
Running eval method from resnet50 on cuda in eager mode.
Unsupported operator aten::max_pool2d encountered 1 time(s)
Unsupported operator aten::add_ encountered 16 time(s)
GPU Time: 64.208 milliseconds
CPU Dispatch Time: 13.305 milliseconds
CPU Total Wall Time: 64.216 milliseconds
FLOPS: 16.5236 TFLOPs per second
```
Fx2trt result with batch sizer 128 on V100:
```
GPU Time: 17.779 milliseconds
CPU Dispatch Time: 0.465 milliseconds
CPU Total Wall Time: 17.778 milliseconds
FLOPS: 59.6853 TFLOPs per second
```
The hardware roofline for V100 is 120 TFLOPs per second.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/668
Reviewed By: yinghai
Differential Revision: D33280207
Pulled By: xuzhao9
fbshipit-source-id: e84f634b0ced5b8603d24e0dbb56a0d935bad907