ThroughputBenchmark: integration with Autograd Profiler (#36282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36282
The reason to do this explicitly in the tool is that we don't want to capture warmup in profiling (as well as input cloning). So instead we make the benchmarking code explicitly aware of the profiler.
Example output:
```
I0408 16:06:40.300040 85516 throughput_benchmark-inl.h:106] Using Autograd profiler. Trace will be saved to /tmp/tmpt0gsz85y
I0408 16:06:40.302232 85516 throughput_benchmark-inl.h:111] Starting threads
I0408 16:06:40.302258 85524 throughput_benchmark-inl.h:78] Starting forward thread 1
I0408 16:06:40.302259 85525 throughput_benchmark-inl.h:78] Starting forward thread 2
I0408 16:06:40.302261 85523 throughput_benchmark-inl.h:78] Starting forward thread 0
I0408 16:06:40.302259 85526 throughput_benchmark-inl.h:78] Starting forward thread 3
I0408 16:06:40.412879 85525 throughput_benchmark-inl.h:88] Shutting down forward thread 2. Total number of finished threads: 1
I0408 16:06:40.412971 85523 throughput_benchmark-inl.h:88] Shutting down forward thread 0. Total number of finished threads: 2
I0408 16:06:40.412989 85526 throughput_benchmark-inl.h:88] Shutting down forward thread 3. Total number of finished threads: 3
I0408 16:06:40.413033 85524 throughput_benchmark-inl.h:88] Shutting down forward thread 1. Total number of finished threads: 4
I0408 16:06:40.413056 85516 throughput_benchmark-inl.h:123] Finished benchmark
Average latency per example: 443.256us
Total number of iterations: 1000
Total number of iterations per second (across all threads): 9024.12
Total time: 110.814ms
```
Test Plan: Imported from OSS
Differential Revision: D20987125
Pulled By: ezyang
fbshipit-source-id: 1f8980c3a5a0abdc268c7a16c99aa9ea868689eb