[MLIR][GPU] Fix async.yield gpu.async.token lowering race (#190717)
Root cause of #170833 (flakiness of `Integration/GPU/CUDA/async.mlir` on
the Tesla T4 mlir-nvidia buildbot).
In `gpu-to-llvm`, two patterns matched `async.yield` with the same
benefit: the structural `ConvertYieldOpTypes` from
`populateAsyncStructuralTypeConversionsAndLegality` (which just retypes
operands), and `ConvertAsyncYieldToGpuRuntimeCallPattern` (which also
creates and records an event on the stream backing each
`gpu.async.token` operand). When the IR contained `gpu.launch_func`, the
dialect-conversion framework picked the structural pattern, silently
dropping the event record. The `async.execute` then yielded a stream
pointer where its consumers expected an event, and the host await ended
up calling `cuEventSynchronize` on a stream pointer. That call returns
an error without waiting, so the host raced against the GPU.
This change implements a fix which registers
`ConvertAsyncYieldToGpuRuntimeCallPattern` with pattern benefit 2 so it
wins on yields carrying `gpu.async.token` operands. The structural
rewriter still handles yields without token operands.
Also adds a new test `lower-async-to-gpu-runtime-calls.mlir` to check
the correct IR shape of `async.yield` after a `gpu.launch_func`.
Assisted-by: Claude
Fixes #170833