Put "everything" WaitCounters in dynamo_timed (#151757)
Summary:
The main motivation is to capture the cudagraphs overhead in a WaitCounter. We'll combine that with Triton autotuning, and therefore rename to "compile_runtime_overheads". Since we have a couple WaitCounters where we want to capture all runtime and compile overheads, let's put the accounting in dynamo_timed so we'll automatically capture any toplevel timed regions that get added in the future. Also, dynamo_timed already has to figure out if we're timing a runtime vs. compile-time event, so we can reuse some of that logic.
X-link: https://github.com/pytorch/pytorch/pull/151757
Approved by: https://github.com/ppanchalia
ghstack dependencies: #151749
Reviewed By: wdvr
Differential Revision: D73440149
fbshipit-source-id: 1b9074bef52b902da09001b4c006661c7d537477