use Timer for cuda benchmarks
`torch.cuda.synchronize()` is a heavy hammer and distorts benchmarking results a lot. Timer provides results that are closer to kernel times observed in profiler.
If you want, instead of `blocked_autorange` you can use `timeit` that repeats the stmt fixed number of times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75393
Approved by: https://github.com/davidberard98