Add lazy_bench.py to measure trace overhead and compute efficiency (#68563)
This tool has 2 new 'experiments' and builds on infra from @jansel's torchdynamo bench script torchdynamo/torchbench.py.
The infra components take care of
iterating over torchbench models in a more convenient way, handling filtering, errors
correctness checks
mixing in non-torchbenchmark benchmarks
interleaving measurements of the control/experiment and computing statistical significance
hooks for synchronization that can be specialized for cuda or lazytensor
custom sync modes to allow syncing after every step or running many async steps before syncing
The overhead experiment compares the lazy trace overhead with the full cuda execution time. This may sound like a strange choice, but the point is to provide a reference point of how much time lazy tracing takes with reference to the time eager normally spends in execution. The expectation is a small fraction.
an alternative approach is to measure lazy tracing against the portion of eager that launches cuda kernels but never sync. If we could guarantee that some of the time, the cuda driver didn't force syncs, this would be fair. It wouldn't work for CPU. At least the full-sync way is consistent
another alternative is to compare lazy trace overhead to execution with meta-tensors. We don't expect meta-tensors to work for this case, since we know that many of the ops we lazy-trace are not yet structured kernels.