tiny improvement to the cprofile wrapper (#120100)
Summary:
1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ...
2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly.
X-link: https://github.com/pytorch/pytorch/pull/120100
Approved by: https://github.com/eellison
Reviewed By: huydhn
Differential Revision: D53930648
Pulled By: shunting314
fbshipit-source-id: e7af70f52c655453c5d7b3d7c82aa3e17f69b1df