pytorch
028d2d6e - [NCCL] Enhance watchdog to log exceptions (#54557)

Commit
3 years ago
[NCCL] Enhance watchdog to log exceptions (#54557) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54557 When looping through the nccl communicator cache checking for errors, enhance the watchdog to log exceptions that are set on the communicator. This will allow for better debugability since the NCCL error will be logged when the watchdog receives errors for the communicators and aborts them appropriately. Tested by forcing a NCCL error with NCCL_BLOCKING_WAIT=1 and verifying that the exception is indeed logged. ghstack-source-id: 125124310 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27106699 fbshipit-source-id: 1d2bd9f057a3796ce15dd8a4ce34cf6899eee45c
Author
Parents
Loading