[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture (#104487)
#103503 addresses the situation where additional work is enqueued for the NCCL watchdog to poll during a graph capture---something we want to avoid as the subsequent polling will query an event and crash the capture.
However, there is currently no check that there was not work _already_ enqueued for the watchdog to poll. If there was already work that was enqueued and not cleaned up before the start of a graph capture, then we run into a similar problem where the event query will crash the capture. We've observed this causing crashes on several workloads, although it is somewhat flaky (if the watchdog happens to poll just before the graph capture and cleanup, then we dodge the crash).
This is a bit of a tricky issue as it involves making sure that no process group has enqueued work at the start of a capture, and as such the simplest solution is to add a bit of global state to track all process groups. This PR forces the start of the graph capture to wait until all enqueued work is completed and cleaned up or times out.
I did consider the alternative of simply having the watchdog skip cleanup if we detect that we are in the middle of a graph capture, but I think deferring the cleanup until later could result in false positive timeouts (e.g., if we defer cleanup on work that has completed long ago, checking timers after the graph capture could yield a "timeout").
CC @Aidyn-A
@bottler @kwen2501 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104487
Approved by: https://github.com/kwen2501