Simplify cross-device sync: use only cuEventSynchronize
Previous approach with barrier events was corrupting CUDA state.
cuEventRecord is asynchronous, so destroying the event immediately
after recording caused undefined behavior.
Now use simple host synchronization:
- cuEventSynchronize blocks CPU until event completes
- Subsequent enqueues to target stream happen after event completion
- No need for barrier events or additional synchronization