Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919
Today, there are two limitations of barrier:
One is that barrier does not support custom #device:
https://github.com/pytorch/pytorch/blob/fbdb86c1747737c744ad79b5da6bcbd064dc982e/torch/csrc/distributed/c10d/ProcessGroup.hpp#L512-L522
The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
https://github.com/pytorch/pytorch/blob/789070986c32835bd03f2d14bd77dd31e59ef95d/torch/distributed/distributed_c10d.py#L3504-L3508
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501