[PyTorch] Replace RecordFunction shouldRun callback with atomic bools (#56504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504
Having callbacks registered but disabled via their
`shouldRun` callback defeats the `shouldRunRecordFunction`
optimization (no relation between the two things, despite the
shared prefix on the names) that aims to skip `RecordFunction`
construction.
This diff attempts to safely rectify this issue: we drop support for
`shouldRun` callbacks (this is bc-breaking; does anything use these
externally? do I need to add the support back and just stop using it
internally?), add support for enabling and disabling callbacks, and
(for global callbacks) make doing so thread-safe.
There is an interesting subtlety with `std::atomic` that came up: it
is neither copyable nor movable, which precludes putting it into
`std::vector`. I manually overrode this because the thread safety
reasons it is neither copyable nor movable don't apply here; we
already state that adding or removing callbacks (the operations that
might copy/move an atomic) are not thread-safe and should be done at
initialization time.
ghstack-source-id: 129614296
Test Plan:
Existing CI should cover correctness, right? Inspected
perf report of a simple benchmark that runs nn.Linear in a loop on
CUDA, where internally have Kineto initialized and thus had a
shouldRun observer previously; we are no longer going through the
dispatcher's slow RecordFunction path or spending measurable time
constructing RecordFunction instances.
Reviewed By: ilia-cher
Differential Revision: D27834944
fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92