ddp_experiments: improve error handling (#1264)
Summary:
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1264
- Add a timeout on the model subprocess itself; this way, the subprocess
will get killed and we won't timeout on the outer barrier call
- When waiting for the model subprocess, wait in a loop where we
alternate between checking the queue and checking if the process is
finished. This prevents us from waiting for the entire timeout when the
worker process fails, and also prevents the worker process from hanging
in the case where the queue fills up and blocks.
Test Plan: Imported from OSS
Reviewed By: wconstab, aazzolini
Differential Revision: D40825608
Pulled By: davidberard98
fbshipit-source-id: d5c185a464e39b35b03b324a55b3beea022e6e84