Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39694.
[`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878
Reviewed By: izdeby
Differential Revision: D24550403
Pulled By: xw285cornell
fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc