[CUDA] Use USM Device memory for multi-GPU tests instead of USM Shared
CUDA Managed Memory (USM Shared) does not support explicit cross-device
copies between separate per-device allocations. NVIDIA documentation shows
Managed Memory as a single shared buffer with automatic migration.
For multi-GPU tests on CUDA, use USM Device memory which supports
cudaMemcpyPeer for peer-to-peer transfers, as documented in CUDA
Programming Guide section 3.4.2.1.