tiny improvement to the cprofile wrapper (#120100)
1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ...
2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100
Approved by: https://github.com/eellison