Modify D2H copy with a different dtype (#80607)
This PR fixes #79933.
The `copy_kernel_cuda` is slightly modified for doing the copy data with type conversion on the GPU.
The profiling of the following code snippet from the issue demonstrates the following behavior:
![d2h](https://user-images.githubusercontent.com/31858918/178574835-e932ec45-6b07-4682-a6f0-71c0e48c2fb1.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80607
Approved by: https://github.com/ngimel