[Mosaic GPU] Fix a race in test_remote_async_copy
Previously the kernel in the test would exit without observing the
completion of the current GPU's and the peer GPU's writes to each
other's memory, which occasionally led to failing correctness checks.
We fix the race by adding a system memory barrier to observe the
completion of this GPU's write to the peer GPU's memory, and a semaphore
that allows to observe the peer's GPU write to this GPU's memory.