Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
other changes:
need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards
also updated the torchao APIs to the current versions
X-link: https://github.com/pytorch/benchmark/pull/2394
Originally Reviewed By: xuzhao9
X-link: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
Reviewed By: xuzhao9, PaliC
Differential Revision: D60252821
Pulled By: HDCharles
fbshipit-source-id: 08ad452c5fcb34182c9aa7da1fe761db9587de71