[Profiler] Add speedup estimate for FP32 pattern and Extra CUDA Copy Pattern (#81501)
Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine.
Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py
Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81501
Approved by: https://github.com/robieta