[Profiler] Pay for what you use (v2) (#74484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74484
In my first attempt at this in December I stamped out specializations using variadic templates. However I'm able to get comparable performance using simple conditionals since the branch is very predictable and AppendOnlyList::emplace_back is low enough overhead that multiple calls don't cause an issue.
This is also a chance to do some BE: rather than force ops and backend events to use the same fields (which in practice means setting a bunch of default values when reporting backend events), I just split them and use a variant.
Test Plan: The single threaded benchmark (with no extra options set) improved considerably from ~0.88 us to ~0.62 us. The stress test benchmark improved modestly from ~6.1 us to ~5.8 us. So the bottleneck for multi-threading is somewhere else, but doing less wasted work is still able to move the needle a little bit.
Reviewed By: swolchok
Differential Revision: D34779994
fbshipit-source-id: 392bc7c6f12797fa5e18777063aa21210d9d2067
(cherry picked from commit f0a49ff7be8aa65bab2f6952cc2e6306c1edc24b)