use legacy stream mode (#2076)
In ORT, there is only 3 cuda stream: default, HtoD, DtoH. And both HtoD and DtoH are non-blocking stream. Thus, per-thread stream mode doesn't have any benefit.
I also tried in multiple thread env and the legacy mode is also better than per-thread model.
Below is the perf of a 3 layer bert on v100. Unit is ms:
batch size 1:
concurrency | c=1 | c=2 | c=4
legacy | 0.54 | 1.17 | 2.68
per-thread | 0.66 | 1.37 | 2.86
batch size 4:
concurrency | c=1 | c=2 | c=4
legacy | 1.1 | 2.22 | 4.6
per-thread | 1.21 | 2.44 | 4.98
batch size 64:
concurrency | c=1 | c=2 | c=4
legacy | 8.09 | 16.13 | 32.37
per-thread | 8.18 | 16.26 | 32.45