pytorch
bc66ddb5 - Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)

Commit View On GitHub

Commit

1 year ago

Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134) Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8 Changes: - introduce new error type - Update `C10D_NCCL_CHECK` Sample script to demonstrate new error type ```python # python -m torch.distributed.run --nproc_per_node=2 <script>.py import torch import torch.distributed as dist if __name__ == "__main__": dist.init_process_group("nccl") dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0) ``` Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134 Approved by: https://github.com/rohan-varma

Author

H-Huang

Committer

pytorchmergebot

Parents

1a7c4b0d

pytorch bc66ddb5 - Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)

Commit

pytorch
bc66ddb5 - Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)