[Profiler] Prefer TSC to wall clock when available (#73855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73855
Calling the clock is one of the most expensive parts of profiling. We can reduce the profiling overhead by using `rdtsc` instead. The tradeoff is that we have to measure and convert. (shift and scale)
Test Plan: I added a cpp unit test with *very* aggressive anti-flake measures. I also ran the overhead benchmark (9 replicates) with `--stressTestKineto` (0.94 -> 0.89 us) and `--stressTestKineto --kinetoProfileMemory` (1.27 -> 1.17 us)
Reviewed By: chaekit
Differential Revision: D34231071
fbshipit-source-id: e3b3dd7580d93bcc783e87c7f2fc726cb74f4df8
(cherry picked from commit e8be9f8160793c6ee35d5af02bca3e01703e377d)