Support to allow user to specify compute stream per session (#3723)
* Support to allow user to specify compute stream per session
Create computation cuda stream explicitly rather than use default legacy stream or per-thread default stream.
remove some redudant cudaStreamSynchronize
fix gpt2 model test failures
don't use default stream in nccl either.
add stream schronization in OnRunEnd()
using cub::DeviceScan::InclusiveSum which can be called with stream specified.
fix topK failure due to latest rebase
fix tensorrt
support user specified stream
add user_stream support in tensorrt EP
use same stream for both tensort and CUDA EP.
fix ScatterND
specify stream for adasum and p2p kernels.
fix loop
fix CApiTest.custom_op_handler
fix CApiTest.varied_input_custom_op_handler
change for cudaMemcpyFromSymbol
improve provider options for user specified compute stream
* add changes for ROCM EP
* fix GatherGrad UT for ROCM EP
* clean code and fix NonMaxSuppression
* use default stream for ROCM now
* fix CApiTest.custom_op_handler:OrtFormatCustomOpTests.ConvertOnnxModelToOrt
* fix tensorrt ut: CApiTest.io_binding_cuda
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>