[Mosaic GPU] Fix predicate for `tcgen05_mma` lowering.
From https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma:
> The instruction `tcgen05.mma` has single thread semantics, unlike the collective instructions `mma.sync` or `wgmma.mma_async`. So, a single thread issuing the `tcgen05.mma` will result in the initiation of the whole matrix multiply and accumulate operation.
This is consistent with LANE lowering semantics.
PiperOrigin-RevId: 880831793