Run performance test non-alternately
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
other changes:
need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards
also updated the torchao APIs to the current versions
Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
(should all be ~1.0
0.997x
1.006x
0.994x
Reviewers:
Subscribers:
Tasks:
Tags: