[Mosaic GPU] Make the profiler warpgroup aware
Instead of creating one timeline per block, we now create one
timeline per warpgroup. This is especially useful when warpgroups
differ in their execution traces.
Also, instead of specifying the total capacity, the profiler now accepts
a number specifying a number of entries per-block. This makes it easier
to find a good size.
PiperOrigin-RevId: 640121060