[Mosaic GPU] Second attempt at fixing `test_tcgen05_collective_mma`.
We need to make sure operands in SMEM have been loaded on both blocks before issuing the MMA instruction.
We also pass `orders_tensor_core=True` to cluster barrier arrive/wait calls so that it also applies to TensorCore operations.
PiperOrigin-RevId: 796837258