handle errors in ProcessGroupAgent::listenLoop(). (#32957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32957
Closes https://github.com/pytorch/pytorch/issues/29703. If there is a
gloo timeout and `recvWork->wait()` times out in `listenLoop()`,
processGroupagent crashes since there is an unhandled exception in a thread.
This catches the exception and exits the listen loop. In a follow up diff, we
will enhance these error conditions so that if users attempt to send RPCs
again, they are notified that the RPC agent was in a bad state and it was
shutdown.
This PR also adds a new option, `processGroupTimeout` to PG agent's backend
options. This allows us to control the gloo timeout.
ghstack-source-id: 98236783
Test Plan: Added a unit test.
Differential Revision: D19678979
fbshipit-source-id: 3895ae754f407b84aca76c6ed3cb087d19178c40