[RPC Reliability] Enabled retries for RPCs with exponential backoff (#33365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365
This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.
GitHub Issue: https://github.com/pytorch/pytorch/issues/32124
Per the first 4 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages
Differential Revision: D19915694
fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c