Use c10::ThreadPool to send and receive messages (#23968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23968
Existing ProcessGroupAgent uses a single thread to send all messages, and
a single thread to listen and process all received messages. This causes
both performance issues and also prevents nested RPCs. For example, when
running nested RPC A->B->A->B, the second recv on B cannot start until
the first recv on B finishes. If the second recv is triggered by a nested
RPC in the first recv, it will deadlock. Ideally, we should expose sth like
responder or FutureResult to the Python land to support nested asynchronous
UDFs.
This diff adds a shared ThreadPool for send and recv. Send use it do send
out messages, and recv use it to process received messages. There is still
a dedicated thread to listen for incoming messages and add it to task queue.
There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a
temporary solution for (a small number of) nested RPCs
ghstack-source-id: 88476246
Differential Revision: D16695091
fbshipit-source-id: fd18a5c65e7fcd1331b73d1287673e6e10d2dd86