[NCCL] Use OptionalCUDAGuard in ProcessGroupNCCL::WorkNCCL::synchronizeInternal (#98895)
Using `CUDAGuard` does redundant `set_device` in the following loop:
```C++
{
for (auto& device : devices_) {
at::cuda::CUDAGuard gpuGuard(device); // set device
// ...
// ~gpuGuard() sets original device
}
// ...
}
```
It would be more efficient to use `OptionalCUDAGuard` as follows:
```C++
{
at::cuda::OptionalCUDAGuard gpuGuard;
for (auto& device : devices_) {
gpuGuard.set_index(device.index()); // set device
// ...
}
// ...
// ~gpuGuard() sets original device
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98895
Approved by: https://github.com/mrshenli