pytorch
f7278473 - [NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411)

Commit
4 years ago
[NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411 This basically aborts errored NCCL communicators if either blocking wait or async error handling is enabled. Otherwise we may abort nccl communicators where neither are enabled, and this may result in subsequent GPU operations using corrupted data. ghstack-source-id: 111839264 Test Plan: Succesful Flow run: f217591683 Reviewed By: jiayisuse Differential Revision: D23605382 fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
Author
Parents
Loading