pytorch
6f8a71aa - [c10d][Fix] Start gloo sequence numbers at 0. (#101422)

Commit

1 year ago

[c10d][Fix] Start gloo sequence numbers at 0. (#101422) Gloo PG used to create a random sequence number and broadcast it to the rest of the group. But when we started enforcing sequence number checks in ProcessGroupWrapper, we observed this was occasionally flaky. For example, this error in a job was wrong, as all ranks were running the first broadcast collective. Somehow the sequence number wasn't communicated across the store correctly: `` RuntimeError: Detected mismatch between collectives on ranks. Rank 16 is running collective: CollectiveFingerPrint(SequenceNumber=1977865401, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=54090078, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 1977865401vs 54090078 ``` The issue reproduces rarely in tests, but is more common in large world size jobs. Differential Revision: [D45870688](https://our.internmc.facebook.com/intern/diff/D45870688/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101422 Approved by: https://github.com/H-Huang

Author

rohan-varma

Committer

pytorchmergebot

Parents

4b849744

pytorch 6f8a71aa - [c10d][Fix] Start gloo sequence numbers at 0. (#101422)

pytorch
6f8a71aa - [c10d][Fix] Start gloo sequence numbers at 0. (#101422)