Add PerThreadContext for TRT EP (#16599)
Maintaining one execution context on a per thread basis is suggested per
TRT
[doc](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
to avoid synchronization issue.
For previous TRT EP, we did see synchronization issues when running
multithreading on some models, for example, FasterRCNN.
This PR leverages per thread context implementation from CUDA EP.
Followings are the modifications:
- Move CUDA graph and IExecutionContext objects to per thread context.
- Remove lock_gruad that previously placed for the whole compute_func()
and put lock_gruad in the blocks where multiple threads may update
kernel function state, access one builder, create/serialize/save engine,
save profile and serialize/save timing cache.
- On CentOS, don't unload TRT EP shared library and leave it around, so
that destructor of thread local data is still accessible upon thread
exits.
Note: Tested this PR with onnxruntime_perf_test and the overhead of
PerThreadContext is small.