onnxruntime
1ebc5d38 - Log ORTModule initialization overhead (#16529)

Commit
2 years ago
Log ORTModule initialization overhead (#16529) ### Log ORTModule initialization overhead When profiling some model for example ``` torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss {'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09} 100%|200/200 [05:03<00:00, 1.52s/it] ***** train metrics ***** epoch = 0.09 train_loss = 6.5695 train_runtime = 0:05:03.87 train_samples = 2223 train_samples_per_second = 0.658 train_steps_per_second = 0.658 ``` The end to end time is 303s (train_runtime=0:05:03.87), but the ORTModule first step initialization (including export, graph build, etc) takes about 255s, so when we compare the end to end time for a baseline ORT with an improved version of ORT, there is no perf gains, since the x% gains over (303-255) is diluted out among the overall 303s. This is misleading! So this PR outputs the ORTModule initialization overhead in the output, then we can manually compute the real compte time and get the perf gains. If the log level is >= WARNING, then only the total end to end time + export time is logged, otherwise, more details of break down is logged: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/8e34283d-4868-4f22-b65b-9f00d10d8fb7) ![image](https://github.com/microsoft/onnxruntime/assets/10530022/c13bcfad-0d79-483d-a886-e238efcbe657)
Author
Parents
Loading