Quality of life improvements to Timer (#53294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53294
Just a bunch of little things, none of which are big enough to need a full PR.
1) C++ wall time should release the GIL
2) Add option to retain `callgrind.out` contents. This will allow processing with kCachegrind for more detailed analysis.
3) Stop subtracting the baseline instruction counts. (People just found it confusing when they saw negative instruction counts.) There is a finesse in #53295 that drops the baseline to ~800 instructions for `number=100`, and at that level it's not worth correcting.
4) Add a `__mul__` overload to function counts. e.g. suppose `c0` was run with `number=100`, and `c1` was run with `number=200`, then `c0 * 2 - c1` is needed to properly diff them. (Obviously there are correctness concerns, but I think it's fine as a caveat emptor convenience method.)
5) Tweak the `callgrind_annotate` call, since by default it filters very small counts.
6) Move some args to kwargs only since types could be ambiguous otherwise.
7) Don't omit rows from slices. It was annoying to print something like `stats[:25]` and have `__repr__` hide the lines in the middle.
Test Plan: Imported from OSS
Reviewed By: Chillee
Differential Revision: D26906715
Pulled By: robieta
fbshipit-source-id: 53d5cd92cd17212ec013f89d48ac8678ba6e6228