Ensure to use correct GPU device in RunSince when it's invoked by new thread (#24192)
Running cuda kernel on incorrect GPU device will end up getting CUDA
error: `invalid resource handle.`
CUDA EP and TRT EP both have this issue when ExecutionMode::ORT_PARALLEL
is enabled.
Repro code:
````python
provider = [
[
('TensorrtExecutionProvider', {
'device_id': 0,
}),
],
[
('TensorrtExecutionProvider', {
'device_id': 1,
}),
]
]
class ThreadObj():
def __init__(self, model_path: str, iterations: int, idx: int):
...
sess_opt = ort.SessionOptions()
sess_opt.execution_mode = ort.ExecutionMode.ORT_PARALLEL
self.inference_session = ort.InferenceSession(model_path, sess_opt, provider[idx % 2])
def warmup(self):
self.inference_session.run(None, self.input)
def run(self, thread_times, threads_complete):
for iter in range(self.iterations):
self.inference_session.run(None, self.input)
def thread_target(obj, thread_times, threads_complete):
obj.run(thread_times, threads_complete)
...
iterations = 500
num_threads = 13
t_obj_list = []
thread_list = []
for tidx in range(num_threads):
obj = ThreadObj(model_path, iterations, tidx)
t_obj_list.append(obj)
obj.warmup()
for t_obj in t_obj_list:
thread = threading.Thread(target=thread_target, daemon=True, args=(t_obj,thread_times,threads_complete,))
thread.start()
thread_list.append(thread)
...
````
The reason is when the inference session is initialized, it can be bound
to device > 0, whereas when running the inference, i.e. RunSince can be
invoked by a new thread and new threads default to using device 0, then
we will hit the error of using the incorrect GPU device.
This PR provides a general fix for both CUDA EP and TRT EP to call
cudaSetDeivce in RunSince.