[Profiler] Pop `KinetoThreadLocalState` at the start of post processing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77996
An issue recently surfaced internally which highlighted the fact that removing KinetoThreadLocalState from the TLS at the end of post processing means that we are profiling memory during post processing. (Which violates a whole bunch of invariants in the system.) This change switches the global profiling ctx to a shared_ptr, introduces a class to manage it (`init`, `get`, and `pop` methods) and moves the `pop` call to the beginning of `disableProfiler`.
Differential Revision: [D36555738](https://our.internmc.facebook.com/intern/diff/D36555738/)
Approved by: https://github.com/aaronenyeshi