Chenta/switch to non blocking stream (#12826)
* switch to non-blocking stream
* fix the wrong cublas handle used in scan / transpose
* fix training build
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>