benchmark
7f1f4d2b - Add more GPU metrics to DCGM monitor (#989)

Commit

3 years ago

Add more GPU metrics to DCGM monitor (#989) Summary: ## New GPU metrics add support for the following GPU metrics, - GPUDRAMActive: The ratio of cycles the device memory interface is active sending or receiving data. - GPUPCIETX: The number of bytes of active PCIe tx (transmit) data including both header and payload. It is supposed to be device memory write traffic. - GPUPCIERX: The number of bytes of active PCIe rx (read) data including both header and payload. It is supposed to be device memory read traffic. ## Export all records to csv file ordered by timestamp Add a new argument `--export-dcgm-metrics` to export all GPU FP32 unit active ratio, memory traffic, and memory throughput records to a csv file. The default csv file name is [model_name]_all_metrics.csv. The final csv file could be be like the following. timestamp(ms) | gpu_fp32active(%) | gpu_picerx(bytes) | gpu_picetx(bytes) | duration(ms) | read_throughput(GB/s) | write_throughput(GB/s) -- | -- | -- | -- | -- | -- | -- 0 | 0 | 17241379 | 155172413 | 0 | | 0.23 | 0 | 3164139 | 9492419 | 0.23 | 12.81 | 38.44 2.6 | 0 | 1131301 | 2036343 | 2.37 | 0.44 | 0.8 3.82 | 0 | 4206098 | 4588471 | 1.22 | 3.22 | 3.51 - `timestamp(ms)` is the timestamp for a record. - `gpu_fp32active(%)` is the ratio of FP32 unit active cycles during this record - `gpu_pcierx(bytes)` is how many bytes read from device memory - `gpu_pcietx(bytes)` is how many bytes write to device memory - `duration(ms)` is how long this record monitors - `read_throughput(GB/s)` is derived by `gpu_pcierx(bytes)` / `duration` *1000/1024/1024/1024 - `write_throughput(GB/s)` is derived by `gpu_pcietx(bytes)` / `duration` *1000/1024/1024/1024 We could easily generate a line chart by opening this file with google sheet or excel. Pull Request resolved: https://github.com/pytorch/benchmark/pull/989 Reviewed By: xuzhao9 Differential Revision: D37434446 Pulled By: FindHao fbshipit-source-id: 4dfc2b964f5bae2a4c18fa8c2e8bae2db3d6a049

Author

FindHao

Committer

facebook-github-bot

Parents

023c5972

benchmark 7f1f4d2b - Add more GPU metrics to DCGM monitor (#989)

benchmark
7f1f4d2b - Add more GPU metrics to DCGM monitor (#989)