[Profiler] Switch to thread local subqueues to reduce lock contention. (#74151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74151
The first of several changes to move to an optimized recording data structure to back profiler. This PR keeps the existing monolithic `OpEventData` struct, but splits storage into thread local subqueues so we don't have to lock to insert.
Test Plan: Unit tests and benchmarks. The single threaded benchmark is unchanged, and the multithreaded stress test dropped from ~21 us to ~6us.
Reviewed By: chaekit
Differential Revision: D34720171
fbshipit-source-id: 90b5ebe618b91099e0a19c1f31cfcd8fe1c2ea12
(cherry picked from commit dfed7901ee329224f8fe0b42ef4981e396d918be)