fix the PE training build. (#12762)
* fix on device training build error
* fix the execute_to_fetch, it can't be shared with partial execution
* fix a stupid bug
* rewrite the partial execution part
* reimplement partial execution support; stash more state into partial execution state
* fix a corner case that switch device in ortmodule
* fix some bugs
* fix the terminate flag
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>