[TGIF][Inplace][Perf] Copy tensor to device with pinned memory & move copy weight sleep to getRecord (#106849)
Summary:
There are 2 changes in the diff that helps optimize perf during inplace update:
1. Read data with pinned memory
2. move the copy weight sleep from between copying the whole Tensor to between copying chunks
Test Plan:
**Local Test**
```
./ai_infra/inference_platform/test_platform/script/run_sigrid_4card.sh --port 7451 --local_model_dir /home/lujia/script --cuda_devices 6 --bind_node 3 --model_id 962549778_514 --gflag_config_path sigrid/predictor/predictor_x_gflags_mrs_prospector_gpu_torchscript_fusedsolution_1card_opt_fm -- --enable_thrift_warmup=false --tgif_replicate_merge_by_tempfile=false --enable_inplace_snapshot_transition --model_version_config_path sigrid/predictor/models_version/lujia_test --inplace_update_max_retries 0 --submod_to_device="merge|cuda0"
```
**Load test on job tsp_eag/smart/inference_platform_sp__sigrid_predictor_gpu_adhoc_realtimetest_m962549778_latest.s3**
Before:
(p99 latency)
{F1066957232}
(SR error rate)
{F1066957650}
After:
(p99 latency)
{F1066957141}
(SR error rate)
{F1066957376}
Differential Revision: D48182533
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106849
Approved by: https://github.com/842974287, https://github.com/kit1980