[CUDA] Use cuMemPrefetchAsync for Managed Memory in urEnqueueUSMMemcpy
For CUDA Managed Memory (CU_MEMORYTYPE_UNIFIED), use prefetch hints
instead of relying solely on automatic migration:
1. Prefetch destination to queue's device before copy
2. Perform cuMemcpyAsync
3. Subsequent kernel access on other device will trigger migration
Also properly handle Device memory cross-device with cuMemcpyPeerAsync.