Fix Windows error in distributed (#60167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60167
We were getting errors such as this on Windows in our c10d ProcessGroup test suite:
```
test_send_recv_all_to_all (__main__.ProcessGroupGlooTest) ... Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Jenkins\Miniconda3\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\Jenkins\Miniconda3\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_distributed.py", line 471, in _event_listener
if pipe.poll(None):
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 257, in poll
return self._poll(timeout)
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 330, in _poll
return bool(wait([self], timeout))
File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 883, in wait
ov.cancel()
OSError: [WinError 6] The handle is invalid
Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=000001EFDF228CE0)
Thread 0x00001f68 (most recent call first):
File "C:\Jenkins\Miniconda3\lib\threading.py", line 1202 in invoke_excepthook
File "C:\Jenkins\Miniconda3\lib\threading.py", line 934 in _bootstrap_inner
File "C:\Jenkins\Miniconda3\lib\threading.py", line 890 in _bootstrap
Current thread 0x00000f94 (most recent call first):
<no Python frame>
FAIL (5.009s)
```
And the process would then exit with error code 3221226505.
See: https://app.circleci.com/pipelines/github/pytorch/pytorch/337351/workflows/ad919a3e-fe9a-4566-8ad6-8b0a252f730c/jobs/14170191/steps
By looking at [the code of `_event_listener` in `common_distributed.py`](https://github.com/pytorch/pytorch/blob/eb36f67dcc7000caa58fa4dfa9c089ce17c6d523/torch/testing/_internal/common_distributed.py#L467-L489) I think that the first exception (the one about the handle being invalid) is "expected" as it results from another thread purposely closing the pipe while that thread is polling it.
The relevant part of the problem seems to be the "could not acquire lock" one. I think this stems from the event listener thread being launched as a daemon thread, which means the interpreter will not wait for that thread to complete before shutting down. When the interpreter shuts down it instantly aborts all other threads. If the event listener thread was aborter _while_ it was logging to stdout then that thread was holding the lock but never got to release it. This is probably what the error is complaining about. This seems to be intended/expected behavior for CPython: https://bugs.python.org/issue42717.
The solution thus is simple: don't make that thread a daemon thread and explicitly wait for it to terminate before shutting down.
ghstack-source-id: 132293710
Test Plan: Will see...
Reviewed By: pritamdamania87
Differential Revision: D29193014
fbshipit-source-id: 4aabe1fc74bf9c54ca605e7a702ac99655489780