Move device lock before the execution instead of tensor gathering (#3457)
* Move device lock before the execution instead of tensor gathering
* Handle OpbyOP Lock
* moving the barrier into RunPostOrder and making changes to coll.indices.empty() condition
* added a conditional barrier to runpostorder to reduce the frequency of early barrier calls. WIP
* moved TensorCollectionBarrier into TryRunCachedSync instead of calling it under if (async != nullptr) { in SyncTensorsGraphInternal
* moved the barrier call to ScheduleSyncTensorsGraph and optimized the barrier call in RunPostOrder
* nit change
* Empty-Commit
* fixing ltc lazy api change
* Empty-Commit
* Added profiling support for RunPostOder. Added race condition caveat comment.
* added a missing device filter to skip calling barrier
* linter fix
* removed barrier_applied
* run test cleanup
* cleaner condition
* linter fix
* addressed feedbacks
* reverted tests
* updated toString API to new format
Co-authored-by: Milad Mohammadi <milad.mo@gmail.com>