pytorch
f3151052 - [autograd] fix engine flakiness (#35599)

Commit View On GitHub

Commit

4 years ago

[autograd] fix engine flakiness (#35599) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35599 We don't check if the ready queue was empty before https://github.com/pytorch/pytorch/pull/33157 because the CPU worker's queue might not be empty, but after #33157, we try to check if the owner thread's ready_queue empty after inline exeuction. This might not always hold true, imagine the following case: The CPU thread that calls backward() and the GPU device thread, the Graph is like: GraphRoot(CPU) -> ComputeNode(GPU) in both thread_main, they are decrementing `--local_graph_task->outstanding_tasks_` to zero together, and then both thread will enter `if (graph_task_completed(local_graph_task))`, CPU thread will break out and finish and check if local_ready_queue is empty, the GPU thread will send a dummy task to CPU thread ready queue as it think the graph_task finished on its own thread (it actually finished on both threads together). So there will be cases that there's a dummy task remains in the queue. This happens very rare and non-deterministic, but it might get triggered when we run many jobs in the CI. Remove the check to fix the flakiness Test Plan: Imported from OSS Differential Revision: D20739778 Pulled By: wanchaol fbshipit-source-id: 75a671762650a188f44720625d53f0873617c684

Author

wanchaol

Committer

facebook-github-bot

Parents

bb32e123

pytorch f3151052 - [autograd] fix engine flakiness (#35599)

Commit

pytorch
f3151052 - [autograd] fix engine flakiness (#35599)