[c10d][Fix] Start gloo sequence numbers at 0. (#101422)
Gloo PG used to create a random sequence number and broadcast it to
the rest of the group. But when we started enforcing sequence number checks in
ProcessGroupWrapper, we observed this was occasionally flaky. For example, this
error in a job was wrong, as all ranks were running the first broadcast
collective. Somehow the sequence number wasn't communicated across the store
correctly:
``
RuntimeError: Detected mismatch between collectives on ranks. Rank 16 is running collective: CollectiveFingerPrint(SequenceNumber=1977865401, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=54090078, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 1977865401vs 54090078
```
The issue reproduces rarely in tests, but is more common in large world size
jobs.
Differential Revision: [D45870688](https://our.internmc.facebook.com/intern/diff/D45870688/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101422
Approved by: https://github.com/H-Huang