pytorch
74a5d62d - NCCL process group: avoid workEnqueue when capturing cuda graph (#102542)

Commit

1 year ago

NCCL process group: avoid workEnqueue when capturing cuda graph (#102542) Summary: In torch.distributed, we make ProcessGroupNCCL not call workEnqueue when the cuda stream is capturing. I.e., when capturing a CUDA graph, we do not enqueue anything for the watchdog thread to consider. This allows capturing NCCL operations in a CUDA Graph. This is followup to an internal discussion [1] where the watchdog thread was observed to crash when using cuda graphs containing an all_reduce. The watchdog thread wants to query events pertaining to enqueued work items, but this can't be done for "events" created during cuda graph capture. [1] https://fb.workplace.com/groups/1405155842844877/posts/6975201909173548/ Test Plan: Test added. Also, the repro mentioned in https://fb.workplace.com/groups/1405155842844877/posts/7003002339726838/ runs successfully after this change. Differential Revision: D46274814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102542 Approved by: https://github.com/kwen2501

Author

bottler

Committer

pytorchmergebot

Parents

88aea179

pytorch 74a5d62d - NCCL process group: avoid workEnqueue when capturing cuda graph (#102542)

pytorch
74a5d62d - NCCL process group: avoid workEnqueue when capturing cuda graph (#102542)