cuda: Track USM allocation metadata for cross-device operations
Problem:
In multi-device contexts, each device has its own primary CUDA context.
When USM memory allocated on device A is accessed from a queue on device B,
using cuMemcpyAsync fails because the stream belongs to context B but
operates on memory from context A.
Root cause:
- urUSMSharedAlloc/urUSMDeviceAlloc allocate memory in device-specific contexts
- urEnqueueUSMMemcpy receives pointers without knowing their origin device
- Cross-context operations require explicit cuMemcpyPeerAsync with both contexts
Solution:
Track allocation metadata in ur_context to record which device allocated
each USM pointer. In urEnqueueUSMMemcpy, query this metadata to detect
cross-device copies and use cuMemcpyPeerAsync with explicit source and
destination contexts.
Changes:
- Add AllocationMetadata map to ur_context_handle_t with thread-safe access
- Register allocations in urUSMDeviceAlloc and urUSMSharedAlloc
- Unregister in urUSMFree
- Query metadata in urEnqueueUSMMemcpy to detect cross-device copies
- Use cuMemcpyPeerAsync for cross-device, cuMemcpyAsync otherwise
This is a clean, O(1) solution that correctly handles cross-context
operations without trial-and-error approaches.