Use cuMemcpyPeerAsync for cross-device USM copies in CUDA adapter
The CUDA adapter was using cuMemcpyAsync() for all USM memory copies,
including cross-device copies. However, CUDA requires cuMemcpyPeerAsync()
for peer-to-peer copies between different devices, even when P2P access
is enabled via cuCtxEnablePeerAccess().
This change:
- Detects cross-device copies by querying CU_POINTER_ATTRIBUTE_CONTEXT
for both source and destination pointers
- Uses cuMemcpyPeerAsync() when contexts differ (cross-device copy)
- Falls back to cuMemcpyAsync() for same-device or host-device copies
This fixes the urEnqueueKernelLaunchIncrementMultiDeviceTest which
chains kernel launches and cross-device memcpy operations.
Fixes: #19033