[CUDA] Fix Managed Memory cross-device copy to work without P2P
For NVIDIA A2 GPUs (GA107 chip) which lack P2P support:
- Prefetch both SRC and DST Managed Memory to CPU before copy
- CUDA driver automatically stages through host: GPU0→CPU→GPU1
- Detect cross-device copies using CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL
- Fix unused variable warning
This enables multi-GPU tests to work on entry-level datacenter GPUs
without NVLink/P2P, at reduced performance (host staging overhead).