Use cuda graphs for benchmarking
Summary:
Per https://fb.workplace.com/groups/420659799592399/posts/807860500872325/, it's a lot more accurate than using regular non-cudagraph benchmarking.
I had to change a bunch of use sites of `metrics.latency` because `do_bench_cudagraph` does not support returning quantiles. Could certainly fix it upstream, but that would take more time + it doesn't really seem like quantiles are that useful in TritonBench anyway.
Reviewed By: xuzhao9, sijiac
Differential Revision: D58502780
fbshipit-source-id: 8c97b95097f49ece47ce9b1660af60afae8c25e8