Chenta/avoid useless notification (#12835)
* avoid create notification for cuda cpu memory tensor
* refactor the notification/wait generation code
* only skip notification for shape
* fix a bug that cause using wrong wait handle
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>