[CUDA] Use different strategies for Managed vs Device memory cross-device copies
For cross-device USM memcpy operations:
- Check memory type using CU_POINTER_ATTRIBUTE_MEMORY_TYPE
- For Managed Memory (CU_MEMORYTYPE_UNIFIED/USM Shared): use cuMemcpyAsync
and let CUDA runtime handle page migration automatically
- For Device Memory (CU_MEMORYTYPE_DEVICE/USM Device): use cuMemcpyPeerAsync
with explicit source and destination contexts
This approach leverages CUDA's Unified Memory subsystem for Managed Memory
while using explicit peer copies for Device Memory.