Cleanup Copy.cu logic (#97071)
Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071
Approved by: https://github.com/ngimel, https://github.com/eellison