benchmark
f30d537c - Add userbenchmark metrics to the distributed benchmark (#954)

Commit
3 years ago
Add userbenchmark metrics to the distributed benchmark (#954) Summary: Test to run on AWS Cluster (8xA100 GPU): ``` python run_benchmark.py distributed --ngpus 8 --partition train --job_dir $PWD/.userbenchmark/distributed/logs ``` Output metrics json file: ``` { "name": "distributed", "environ": { "pytorch_git_version": "367ce697da444978ab49ed7426c1ffee57d1e88b" }, "args": { "ngpus": 8, "nodes": 1, "timeout": 1440, "profiler": false, "partition": "train", "job_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs", "model": "torchbenchmark.e2e_models.hf_bert.Model", "trainer": "torchbenchmark.util.distributed.ddp.DDPTrainer", "dist_url": "file:///data/home/xzhao9/benchmark/.userbenchmark/distributed/logs/300741a6a31248b5af287dad433b9200_init", "output_dir": "/data/home/xzhao9/benchmark/.userbenchmark/distributed/logs" }, "metrics": { "0-fwd_mean": 18.587260818481447, "0-fwd_stdev": 1.249975396652668, "0-bwd_mean": 22.979004859924316, "0-bwd_stdev": 0.17279135941652868, "0-opt_mean": 17.845138931274413, "0-opt_stdev": 0.10701804960748133, "1-fwd_mean": 14.643673610687255, "1-fwd_stdev": 0.12621190012201905, "1-bwd_mean": 31.684182357788085, "1-bwd_stdev": 1.6110578055095222, "1-opt_mean": 12.702611064910888, "1-opt_stdev": 0.11127842204111327, "2-fwd_mean": 13.7170880317688, "2-fwd_stdev": 0.3072370404256818, "2-bwd_mean": 33.75406379699707, "2-bwd_stdev": 1.2665815412703216, "2-opt_mean": 11.773264026641845, "2-opt_stdev": 0.08454682507453214, "3-fwd_mean": 13.955654430389405, "3-fwd_stdev": 0.28808249829358984, "3-bwd_mean": 33.470188522338866, "3-bwd_stdev": 1.2093597218183398, "3-opt_mean": 11.838454341888427, "3-opt_stdev": 0.13043112330328627, "4-fwd_mean": 14.69996166229248, "4-fwd_stdev": 2.096843783502309, "4-bwd_mean": 33.06722240447998, "4-bwd_stdev": 2.384150208768099, "4-opt_mean": 11.545113658905029, "4-opt_stdev": 0.17722284388649193, "5-fwd_mean": 14.81980791091919, "5-fwd_stdev": 0.1175716373064526, "5-bwd_mean": 32.68328037261963, "5-bwd_stdev": 1.3603798665482199, "5-opt_mean": 11.662390422821044, "5-opt_stdev": 0.06255000873065251, "6-fwd_mean": 14.148975849151611, "6-fwd_stdev": 0.10553216426892825, "6-bwd_mean": 33.048796844482425, "6-bwd_stdev": 1.3651520012370102, "6-opt_mean": 12.033942317962646, "6-opt_stdev": 0.039871346805974116, "7-fwd_mean": 14.514809608459473, "7-fwd_stdev": 0.7432848123236139, "7-bwd_mean": 31.55668125152588, "7-bwd_stdev": 1.5674374937909337, "7-opt_mean": 13.237116622924805, "7-opt_stdev": 1.2275919701313986 } } ``` This metrics output can be used to update the internal performance metrics dashboard to track performance. Pull Request resolved: https://github.com/pytorch/benchmark/pull/954 Reviewed By: mrshenli Differential Revision: D37078673 Pulled By: xuzhao9 fbshipit-source-id: 694c9939b1caaaa567629fedd63d08e3016970e8
Author
Parents
Loading