pytorch
f5675f83 - [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)

Commit

3 years ago

[torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412 The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/ When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it). Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test User workflow: f263531643 Reviewed By: cbalioglu Differential Revision: D27602838 fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1

Author

aivanou

Committer

facebook-github-bot

Parents

3bb1f59a

pytorch f5675f83 - [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)

pytorch
f5675f83 - [torchelastic] Make sure torchelastic mp wait for queue to be drained before finishing the process (#55412)