Continuous batching thread safety (#44924)
* fix torch.cuda.graph should operate in thread_local mode
* fix tie_weights skipping logic is not thread-safe
* doc
* cleanup
* revert tie_weight() concurrency bug fix. push to another pr
* cleanup unit test to only check for `thread_local` error_mode
* add true model test
* remove error_mode set unit test
* remove unit test