[CUDA EP] remove per-thread allocator (#5415)
Now that we are using legacy default stream, which is shared among all inference threads,
there is no need to have per-thread allocator.
In the past, the race could happen when two threads running concurrently on GPU:
thread1: allocA->copyA->computeA->freeA
thread2: allocB->copyB->computeB->freeB
Note that freeA/B only means the buffer is ready to be allocated on CPU, while the corresponding
operation on GPU is not finished yet. It is possible for thread1/2 use the same buffer, when the
alloc/free pair are not interleaved (note that alloc/free is thread-safe)
If the GPU commands run in separate per-thread default stream, there's a chance that copyA/computeA
are interleaved with copyB/computeB, even when the order in CPU execution is not interleaved. This
would cause incorrect results if computeB uses copyA's results.
By using one legacy default stream, CPU execution order would match the GPU execution order, so
if A and B use the same buffer from alloc, the correpsonding copy/compute won't be interleaved. If
the copy/compute is indeed interleaved, then allocA and allocB would return different buffers, thus
no racing either.