pytorch
5a2f41a2 - [torch/distributed.elastic] Fix utils.distributed_test.test_create_store_timeout_on_server to be dual-stack ip compatible (#60558)

Commit
3 years ago
[torch/distributed.elastic] Fix utils.distributed_test.test_create_store_timeout_on_server to be dual-stack ip compatible (#60558) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558 Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260 `test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed. Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor. NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124 Test Plan: ``` buck test //caffe2/test/distributed/elastic/utils:distributed_test ``` Reviewed By: cbalioglu Differential Revision: D29334947 fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
Author
Kiuk Chung
Parents
Loading