Log ORTModule initialization overhead (#16529)
### Log ORTModule initialization overhead
When profiling some model for example
```
torchrun --nproc_per_node=1 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss
{'train_runtime': 303.8711, 'train_samples_per_second': 0.658, 'train_steps_per_second': 0.658, 'train_loss': 6.569518616199494, 'epoch': 0.09}
100%|200/200 [05:03<00:00, 1.52s/it]
***** train metrics *****
epoch = 0.09
train_loss = 6.5695
train_runtime = 0:05:03.87
train_samples = 2223
train_samples_per_second = 0.658
train_steps_per_second = 0.658
```
The end to end time is 303s (train_runtime=0:05:03.87), but the
ORTModule first step initialization (including export, graph build, etc)
takes about 255s, so when we compare the end to end time for a baseline
ORT with an improved version of ORT, there is no perf gains, since the
x% gains over (303-255) is diluted out among the overall 303s. This is
misleading!
So this PR outputs the ORTModule initialization overhead in the output,
then we can manually compute the real compte time and get the perf
gains.
If the log level is >= WARNING, then only the total end to end time +
export time is logged, otherwise, more details of break down is logged:

