pytorch
614b8657 - [profiler] _RecordFunctionFast - faster python bindings for record_function (#107195)

Commit View On GitHub

Commit

1 year ago

[profiler] _RecordFunctionFast - faster python bindings for record_function (#107195) torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager. This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just: ```python with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) ``` It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op. * The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible * All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds. For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1). This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this: ```python if torch.autograd.profiler._is_profiler_enabled: with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) else: torch.add(x, y) ``` On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run: * Naive torch.profiler.record_function() is a ~30% slowdown * Always wrapping with RecordFunctionFast causes a regression of ~2-4%. * Guarding with an if statement - any regression is within noise **Selected benchmark results**: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations ``` Baseline:: 1.7825262546539307 us # Just running x.view_as(x) profiled_basic:: 13.600390434265137 us # torch.profiler.record_function(x) + view_as precompute_manual_cm_rf:: 2.317216396331787 us # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as guard_manual_cm_rf:: 1.7994389533996582 us # guard with _is_profiler_enabled + view_as ``` Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi

Author

davidberard98

Committer

pytorchmergebot

Parents

137d96a2

pytorch 614b8657 - [profiler] _RecordFunctionFast - faster python bindings for record_function (#107195)

Commit

pytorch
614b8657 - [profiler] _RecordFunctionFast - faster python bindings for record_function (#107195)