Always enable P2P access for GPU copies (#21872)
Summary:
PR https://github.com/pytorch/pytorch/issues/20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices. I didn't notice a difference on CUDA 10, but ngimel says it's
important for CUDA too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21872
Differential Revision: D15863965
Pulled By: colesbury
fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1