Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:
A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.
Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
"x.backward()",
setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()
for c, fn in counts[:20]:
print(f"{c:>12} {fn}")
```
```
812800 ???:_dl_update_slotinfo
355600 ???:update_get_addr
308300 work/Python/ceval.c:_PyEval_EvalFrameDefault'2
304800 ???:__tls_get_addr
196059 ???:_int_free
152400 ???:__tls_get_addr_slow
138400 build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
126526 work/Objects/dictobject.c:_PyDict_LoadGlobal
114268 ???:malloc
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
85900 work/Python/ceval.c:_PyEval_EvalFrameDefault
79946 work/Objects/typeobject.c:_PyType_Lookup
72000 build/../c10/core/Device.h:c10::Device::validate()
70000 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
66400 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
63000 ???:pthread_mutex_lock
61200 work/Objects/dictobject.c:PyDict_GetItem
59800 ???:free
58400 work/Objects/tupleobject.c:tupledealloc
56707 work/Objects/dictobject.c:lookdict_unicode_nodummy
```
Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions: {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
_ = count_dict.setdefault(fn, 0)
count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
print(f"{c:>8} {fn}")
```
```
Head instructions: 7609547
1.6 instructions: 6059648
169600 ???:_dl_update_slotinfo
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
74200 ???:update_get_addr
63600 ???:__tls_get_addr
46800 work/Python/ceval.c:_PyEval_EvalFrameDefault
33512 work/Objects/dictobject.c:_PyDict_LoadGlobal
31800 ???:__tls_get_addr_slow
31700 build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
28300 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
27800 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
27401 work/Objects/dictobject.c:lookdict_unicode_nodummy
24115 work/Objects/typeobject.c:_PyType_Lookup
24080 ???:_int_free
21700 work/Objects/dictobject.c:PyDict_GetItemWithError
20700 work/Objects/dictobject.c:PyDict_GetItem
...
-3200 build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
-3400 build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
-3500 /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
-3700 build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
-4207 work/Objects/obmalloc.c:PyMem_Calloc
-4500 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
-4800 build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
-5000 build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
-5300 work/Objects/listobject.c:PyList_New
-5400 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
-5600 /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
-6231 work/Objects/obmalloc.c:PyMem_Free
-6300 work/Objects/listobject.c:list_repeat
-11200 work/Objects/listobject.c:list_dealloc
-28900 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```
Remaining TODOs:
* Include a timer in the generated script for cuda sync.
* Add valgrind to CircleCI machines and add a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717
Reviewed By: soumith
Differential Revision: D24010742
Pulled By: robieta
fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623