Retry connecting to TCP store on ECONNRESET (#25707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25707
The retry logic dealt with ECONNREFUSED to deal with the client being
started before the server. It didn't yet deal with the server being
started but having its listen backlog exhausted. This may happen when
starting many processes that all try to connect at the same time.
The server implementation uses blocking I/O to read and write entire
messages, so it may take a bit longer to call `accept(2)` on new
connections compared to a fully event driven approach.
This commit both increases the default listen backlog on the server
side and implements retries on ECONNRESET after `connect(2)`.
Test Plan: Imported from OSS
Differential Revision: D17226958
Pulled By: pietern
fbshipit-source-id: 877a7758b29286e06039f31b5c900de094aa3100