[Mosaic GPU] Profiler improvements
1. Each process now corresponds to an SM, showing how many blocks
are executing concurrently.
2. The timeline now accounts for the start offset of each block,
instead of aligning them together. This makes a lot more sense in
the SM view.
3. We now use inline PTX to emit profiler events. This sometimes slightly
pessimizes code generation, but allows us to predicate out write on
all threads other than the leader of each warpgroup, improving the
trace quality.
4. We make sure each trace is monotonic. I can't explain why but the clocks
can behave very weirdly, potentially due to rescheduling on the SASS level.
We now fix up all backward movements and emit a warning if big shifts have
been detected.
PiperOrigin-RevId: 659911268