Modify D2H copy with a different dtype (#80607)
This PR fixes #79933.
The `copy_kernel_cuda` is slightly modified for doing the copy data with type conversion on the GPU.
The profiling of the following code snippet from the issue demonstrates the following behavior:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80607
Approved by: https://github.com/ngimel