pytorch
f62d8b2a - ProcessGroupWrapper log full rank fingerprint mismatches (#79901)

Commit
2 years ago
ProcessGroupWrapper log full rank fingerprint mismatches (#79901) ### Current Error Message: ``` Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, …), but Rank 1 is running collective: REDUCE. ``` ### Ops Mismatch, New Error Message (shows full fingerprint, includes tensor shape, data types, and device types): ``` Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=ALLREDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). ``` ### Shape Mismatch, New Error Message ``` RuntimeError: Detected mismatch between collectives on ranks. Rank 0 is running collective: CollectiveFingerPrint(OpType=SCATTER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(OpType=SCATTER, TensorShape=[2], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). ``` Changes: - Update deserialize function to read shape of tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/79901 Approved by: https://github.com/rohan-varma
Committer
Parents
Loading