[Profiler] Split observer implementations based on ProfilerState (#71135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71135
The NVTX profiler is quite different from the other Kineto cases, so it's worth it to peel it off early so that later logic can assume either KINETO or KINETO_GPU_FALLBACK. This is more important since we're going to change the Kineto internals. (You can see the python tracer was unnecessarily coupled to NVTX just because the control logic was intermingled.)
There's also no reason to put the legacy observer state in the header rather than the cpp file now that the kineto profiler doesn't need it, so we should shield it from prying eyes.
The recent headaches with TLS downcasting and RPC integration (D32678163 (https://github.com/pytorch/pytorch/commit/7ea86dfdb162758c9fbbf6807ab1dd778591c062), D33283314 (https://github.com/pytorch/pytorch/commit/681e78bacec69c3ac6653483da2236d0e0416c6e), D33437773 (https://github.com/pytorch/pytorch/commit/7d6535cab39a6277aa0b40cfca3b9c918ef9e095)) have made crystal clear that we need a lot more safety in the profiler, particularly as we shift things around.
Test Plan: Unit tests. This is no longer a performance PR.
Reviewed By: aaronenyeshi
Differential Revision: D32710829
fbshipit-source-id: f9138598b3cfeba71872905a7afab3c03c0d56e7
(cherry picked from commit 059a39d8e3b184337ddd401cfd242c47b8ad0538)