[rfc] Reduce number of coin flips in RecordFunction (#40758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758
Currently we flip a coin for each sampled callback each time
we run RecordFunction, this PR is an attempt to skip most of the coin
flips (for the low-probability observers) and keep the distribution
close to the original one
Test Plan:
CI and record_function_benchmark
```
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 30108 us.
Time per iteration (1x1): 1496.78 us.
Time per iteration (16x16): 2142.46 us.
Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 19051 us.
Time per iteration (1x1): 1581.89 us.
Time per iteration (16x16): 2195.67 us.
Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18715 us.
Time per iteration (1x1): 1566.11 us.
Time per iteration (16x16): 2131.17 us.
Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18814 us.
Time per iteration (1x1): 1536.2 us.
Time per iteration (16x16): 1985.82 us.
Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18278 us.
Time per iteration (1x1): 1526.32 us.
Time per iteration (16x16): 2093.77 us.
Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013
(python_venv) iliacher@devgpu151:~/local/pytorch (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18545 us.
Time per iteration (1x1): 1524.65 us.
Time per iteration (16x16): 2080 us.
Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048
```
Reviewed By: dzhulgakov
Differential Revision: D22320879
Pulled By: ilia-cher
fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc