[CUDA] support user_compute_stream in python API (#19229)
### Description
It is an important feature to pass user cuda stream to avoid
synchronization in python API. Here we allow user to pass cuda stream
for CUDA provider. Note that TRT or ROCm provider need similar change,
which are not included in this pull request.
Note that we will set `has_user_compute_stream` automatically based on
whether there is cuda stream passed, so setting
`has_user_compute_stream` through python API has no effect.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/19094