`_foreach_copy` with different src/dst dtypes (#121717)
Fixes #115171
```
torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation'
[------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 14.2 | 12.6 | 12.7
num_tensors: 256 | 688.0 | 510.3 | 514.0
num_tensors: 1024 | 2768.0 | 2053.3 | 2047.7
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 10.0 | 8.9 | 8.8
num_tensors: 256 | 497.6 | 344.3 | 348.3
num_tensors: 1024 | 1991.9 | 1392.0 | 1389.0
Times are in microseconds (us).
[----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 10.0 | 8.8 | 8.8
num_tensors: 256 | 497.5 | 344.5 | 348.0
num_tensors: 1024 | 1993.2 | 1390.4 | 1387.5
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 19.0 | 17.9 | 18.1
num_tensors: 256 | 707.2 | 540.2 | 543.1
num_tensors: 1024 | 2900.6 | 2156.6 | 2159.2
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 13.8 | 13.7 | 13.1
num_tensors: 256 | 513.2 | 352.6 | 350.4
num_tensors: 1024 | 2047.6 | 1404.4 | 1400.4
Times are in microseconds (us).
[----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 13.6 | 12.8 | 14.2
num_tensors: 256 | 511.9 | 351.8 | 350.6
num_tensors: 1024 | 2045.4 | 1402.2 | 1401.4
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717
Approved by: https://github.com/janeyx99