Add support for NVIDIA DCGM based FLOPs/sec calculation (#929)
Summary:
Add choices for `--flops`:
- `--flops model`: this option will use an estimation method to calculate the flops.
- `--flops dcgm`: this option will use NVIDIA DCGM API to collect hardware counters for FP32 computations, and calculate the flops.
## Dependency
[NVIDIA DCGM](https://developer.nvidia.com/dcgm) is required by this function and could be easily installed following the [official installation guide](https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/getting-started.html#installation).
`numba` is the required dependent package which could be installed by `pip install numba`.
## Run
For example, you can run the following command to get the flops of resnet50.
```
python run.py -d cuda --flops dcgm resnet50
```
The last part of the output is supposed to be like the following.
```
GPU Time: 12.097 milliseconds
CPU Total Wall Time: 12.137 milliseconds
FLOPS: 1.9684 TFLOPs per second
Correctness: Correct
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/929
Reviewed By: xuzhao9
Differential Revision: D36644440
Pulled By: FindHao
fbshipit-source-id: a927bf891ccc0b590af69e3cf5d062440ff371b6