only Flush once for the same stream in copyInputAcrossDevice() (#17303)
### Description
<!-- Describe your changes. -->
In CopyInputAcrossDevice() function, we assign each feed a stream to
copy across device, once the copy is done, each stream will trigger the
Flush() function which is undesired. Same stream should be only flushed
once
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is to address a perf issue of TLNGv4 inference which
contains subgraph with many input feeds.