Improvements to `stack` logging
Summary:
X-link: https://github.com/pytorch/pytorch/pull/140431
This diff implements multiple improvements to the `stack` field in PT2 Compile Events.
- First, it removes `__start__` from all stacks in Pt2 Compile Events, as it's unnecessary.
Finally, it fixes a bug where we are logging the entire chromium stack instead of just the stack involving events that are logged to scuba. At first, I thought it was safe to log the entire stack, but it leads to icicle views that "lie" about time taken for intermediate events that are logged to TLParse, but not Scuba.
As an example, https://fburl.com/scuba/pt2_compile_events/11pqmcnt
this icicle view shows that AOTAutogradCache.load is only 50% of backend_compile. But in reality, since AOTAutogradCache.load doesn't have its own event in the table, it's only showing the portions of **events whose stacks include AOTAutogradCache.load**, which is incorrect. The actual duration of AOTAutogradCache.load is actually the entire backend_compile time. {F1958436460}
In reality, the stack in TLParse looks more like this:
{F1958437327}
Where the extra time is accounted for inside of AOTAutogradCache.load and its sub events, but unaccounted for by the aggregated stack.
Therefore, we should only log events to the Scuba stack that are themselves in the scuba dataset to begin with. To do that, we keep a pt2_stack along with a regular stack, which is the subset of the regular stack that gets logged to scuba.
This maintains the ability to register event metadata and the regular stack in TLParses.
Reviewed By: ezyang, masnesral
Differential Revision: D65832045
fbshipit-source-id: 9b4f2f056fd9fcb315958e08438c087b51a30c23