Use c10 threadpool for GPU to CPU distributed autograd continuations. (#42511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42511
DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.
To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".
For more context please see:
https://github.com/pytorch/pytorch/issues/40255#issuecomment-663298062.
ghstack-source-id: 109997718
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D22917579
fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155