Add CUDA plugin EP Sync support for IOBinding (#27919)
## Description
This change wires the CUDA plugin EP into ORT's sync surface (see
https://github.com/microsoft/onnxruntime/pull/27538) so `IOBinding` can
safely coordinate device work when inputs and outputs are bound on CUDA.
It also clarifies the split between EP-level and factory-level
sync-stream creation in the design doc and adds Python coverage to
validate the new path with simple CUDA-bound models.
## Summary of Changes
### CUDA plugin EP implementation
| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.cc` | Registers
`OrtEp::CreateSyncStreamForDevice` and `OrtEp::Sync` in `CudaEp`, adds
per-session CUDA sync-stream creation, and implements a conservative
device-wide sync via `cudaSetDevice` + `cudaDeviceSynchronize()`. |
| `onnxruntime/core/providers/cuda/plugin/cuda_ep.h` | Declares the new
`CreateSyncStreamForDeviceImpl` and `SyncImpl` entry points on `CudaEp`.
|
### Tests
| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/test_cuda_plugin_ep.py` | Adds a
helper to resolve the CUDA ordinal from plugin device metadata and adds
`IOBinding`-based Add and MatMul tests that bind CUDA inputs/outputs and
exercise the plugin EP sync path. |
### Documentation
| File | Change |
|------|--------|
| `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | Documents that
`CudaEp` owns the preferred `OrtEp::CreateSyncStreamForDevice` and
`OrtEp::Sync` implementations, while
`CudaEpFactory::CreateSyncStreamForDevice` remains a fallback path; also
records the new `IOBinding` test coverage. |
## Testing
- Set `ORT_CUDA_PLUGIN_PATH` to the rebuilt CUDA plugin library under
`build/cuda/Release` and run `python -m pytest
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py`.
- Verify the new `IOBinding` Add and MatMul tests pass with CUDA-bound
`OrtValue` inputs and outputs.
- Confirm existing CUDA plugin EP behavior is unchanged for
non-`IOBinding` execution paths.
## Motivation and Context
`IOBinding` relies on provider synchronization to ensure asynchronous
device copies are complete before dependent kernel execution continues.
The CUDA plugin EP already supported sync-stream creation at the factory
layer, but the staged changes connect the per-session `OrtEp` callbacks
that ORT prefers when coordinating bound CUDA execution. The
documentation updates make that ownership model explicit so future
plugin work does not conflate the fallback factory hook with the primary
EP hook.
## Checklist
- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes
- [ ] CI passes