Add repeats to Timer.collect_callgrind(...) (#53295)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53295
A lot of the time spent in `collect_callgrind` is spinning up Valgrind and executing the initial `import torch`. In most cases the actual run loop is a much smaller fraction. As a result, we can reuse the same process to do multiple replicates and do a much better job amortizing that startup cost. This also tends to result in more stable measurements: the kth run is more repeatable than the first because everything has been given a chance to settle into a steady state. The instruction microbenchmarks lean heavily on this behavior. I found that in practice doing several `n=100` replicates to be more reliable than one monolithic 10,000+ iteration run. (Since rare cases like memory consolidation will just contaminate that one replicate, as opposed to getting mixed into the entire long run.)
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26907093
Pulled By: robieta
fbshipit-source-id: 72e5b48896911f5dbde96c8387845d7f9882fdb2