[RecordFunction] Don't lazily construct the guts of RecordFunction. (#76016)
Summary:
When we were pre-sampling this was a pretty important optimizaton. However now when we make a record function we can be sure that it will be called.
For the RECORD_FUNCTION macros I preserved the old behavior by making a `c10::optional<RecordFunction>` since we can't force callers to have separate paths the way Dispatcher does.
Maybe it makes sense to have a guard that handles the optional logic? If we can move enough out of the internals (e.g. replace `std::string`s with `char*`s) we might not even need the optional to get good perf.
Test Plan: The no-op observer overhead benchmark got a bit better, but even with lots of replicates it's hard to tell if that's just noise. This is primarily a change to simplify the semantics of RecordFunction.
Reviewed By: chaekit
Differential Revision: D35276157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76016
Approved by: https://github.com/chaekit