Make sure we dealloc on recording, not just replay (#97440)
Copy over non cuda graph inputs as we are allocating the recording inputs so they do not need to be allocated as we record the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97440
Approved by: https://github.com/ezyang