Add more GPU metrics to DCGM monitor (#989)
Summary:
## New GPU metrics
add support for the following GPU metrics,
- GPUDRAMActive: The ratio of cycles the device memory interface is active sending or receiving data.
- GPUPCIETX: The number of bytes of active PCIe tx (transmit) data including both header and payload. It is supposed to be device memory write traffic.
- GPUPCIERX: The number of bytes of active PCIe rx (read) data including both header and payload. It is supposed to be device memory read traffic.
## Export all records to csv file ordered by timestamp
Add a new argument `--export-dcgm-metrics` to export all GPU FP32 unit active ratio, memory traffic, and memory throughput records to a csv file. The default csv file name is [model_name]_all_metrics.csv.
The final csv file could be be like the following.
timestamp(ms) | gpu_fp32active(%) | gpu_picerx(bytes) | gpu_picetx(bytes) | duration(ms) | read_throughput(GB/s) | write_throughput(GB/s)
-- | -- | -- | -- | -- | -- | --
0 | 0 | 17241379 | 155172413 | 0 | |
0.23 | 0 | 3164139 | 9492419 | 0.23 | 12.81 | 38.44
2.6 | 0 | 1131301 | 2036343 | 2.37 | 0.44 | 0.8
3.82 | 0 | 4206098 | 4588471 | 1.22 | 3.22 | 3.51
- `timestamp(ms)` is the timestamp for a record.
- `gpu_fp32active(%)` is the ratio of FP32 unit active cycles during this record
- `gpu_pcierx(bytes)` is how many bytes read from device memory
- `gpu_pcietx(bytes)` is how many bytes write to device memory
- `duration(ms)` is how long this record monitors
- `read_throughput(GB/s)` is derived by `gpu_pcierx(bytes)` / `duration` *1000/1024/1024/1024
- `write_throughput(GB/s)` is derived by `gpu_pcietx(bytes)` / `duration` *1000/1024/1024/1024
We could easily generate a line chart by opening this file with google sheet or excel.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/989
Reviewed By: xuzhao9
Differential Revision: D37434446
Pulled By: FindHao
fbshipit-source-id: 4dfc2b964f5bae2a4c18fa8c2e8bae2db3d6a049