Profiler: Do not record zero duration kernel events (#41540)
Summary:
Changes in the ROCm runtime have improved hipEventRecord. The events no longer take ~4 usec to execute on the gpu stream, instead they appear instantaneous. If you record two events, with no other activity in between, then they will have the same timestamp and the elapsed duration will be 0.
The profiler uses hip/cuda event pairs to infer gpu execution times. It wraps functions whether they send work to the gpu or not. Functions that send no gpu work will show as having zero duration. Also they will show as running at the same time as neighboring functions. On a trace, all those functions combine into a 'call stack' that can be tens of functions tall (when indeed they should be sequential).
This patch suppresses recording the zero duration 'kernel' events, leaving only the CPU execution part. This means functions that do not use the GPU do not get an entry for how long they were using the GPU, which seams reasonable. This fixes the 'stacking' on traces. It also improves the signal to noise of the GPU trace beyond what was available previously.
This patch will not effect CUDA or legacy ROCm as those are not able to 'execute' eventRecord markers instantaneously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41540
Reviewed By: zou3519
Differential Revision: D22597207
Pulled By: albanD
fbshipit-source-id: 5e89de2b6d53888db4f9dbcb91a94478cde2f525