[PyTorch] Lazily construct guts of RecordFunction (#47550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47550
I saw over 5% time spent in RecordFunction's ctor during one
of our framework overhead benchmarks in `perf`. Inspecting assembly,
it looks like we just create a lot of RecordFunctions and the
constructor has to initialize a relatively large number of member
variables.
This diff takes advantage of the observation that RecordFunction does
nothing most of the time by moving its state onto the heap and only
allocating it if needed. It does add the requirement that profiling is
actually active to use RecordFunction accessors, which I hope won't be
a problem.
ghstack-source-id: 117498489
Test Plan: Run framework overhead benchmarks. Savings ranging from 3% (InPlace_ndim_1) to 7.5% (empty_ndim_3) wall time.
Reviewed By: ilia-cher
Differential Revision: D24812213
fbshipit-source-id: 823a1e2ca573d9a8d7c5b7bb3972987faaacd11a