[Profiler] Limit calls to `recordThreadInfo` (#74888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74888
So far as I can tell, `recordThreadInfo` only needs to be called once per thread. Once we have thread local subqueues we can easily manage this by simply calling it in the subqueue constructor.
Test Plan: The effect on single threaded overhead is pretty minimal, but it improves stress test overhead from ~6.1 us to ~1.4us since we're no contending over the lock in Kineto.
Reviewed By: chaekit
Differential Revision: D34811694
fbshipit-source-id: da1047f7ae43af048773610a0f250fa514c67989
(cherry picked from commit 9a5b926fb6d28be45fcc492f350b3a5ad5ed6d6f)