avoid wait on notification whoes tensor is generated on the same device (#12778)
* avoid wait on notification whoes tensor is generated on the same device
for case cpu kernel wait on cuda kernel's cpu output, don't need to wait on cuda notification
* fix linux break
Co-authored-by: Cheng Tang <chenta@microsoft.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>