[CUDA][UR] Use cuMemcpyPeer for cross-device USM copies
The previous synchronous cuMemcpy approach failed because it cannot
properly handle cross-device copies even in synchronous mode.
cuMemcpyPeer explicitly takes source and destination contexts as
parameters and is designed for peer-to-peer copies between different
device contexts. This works for both USM Device and USM Shared memory.
The stream is synchronized before calling cuMemcpyPeer because:
1. cuMemcpyPeer is synchronous (blocks until complete)
2. We need to ensure all pending operations in the stream finish first