[RPC Reliability] Implemented retries for RPCs with exponential backoff (#32602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32602
This adds functionality for re-trying RPC's that are sent with the function `sendWithRetries()`. It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.
GitHub Issue: https://github.com/pytorch/pytorch/issues/32124
Per the first 3 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages
Differential Revision: D19560159
fbshipit-source-id: 40cd86f9a25dc24367624d279a3b9720b20824cf