benchmarks/aggregate: skip first three data points (#6210)
Skipping the first iteration is a good idea but doesn't seem to be sufficient to get reasonable stddev. For instance, yolov3 is quite noisy in the first few iterations, not just the first one:
- Inductor training numbers (in seconds): `79.075, 0.607, 0.115, 0.106, 0.100`
- Pytorch/XLA training: `99.074, 5.638, 5.871, 0.384, 0.367`
I believe it's because the first iteration compiles, and the subsequent two warm up the caches (must be a large model since Inductor takes ~80s to compile it).
In order to reduce stddev in the reports, skip the first three data points.