[Caffe2] Implement BlackBoxPredictor::BenchmarkIndividualOps (#52903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52903
Implement BlackBoxPredictor::BenchmarkIndividualOps so that we can clean up the output tensors properly after each iteration and get more accurate per operator timing.
Add four more metrics to track setup_time, memory_alloc_time, memory_dealloc_time, and output_dealloc_time.
Reviewed By: ajyu
Differential Revision: D26657473
fbshipit-source-id: 1cf282192b531513b9ee40b37252087818412f81