benchmark
8fb03251 - ddp_experiments: improve error handling (#1264)

Commit
3 years ago
ddp_experiments: improve error handling (#1264) Summary: Pull Request resolved: https://github.com/pytorch/benchmark/pull/1264 - Add a timeout on the model subprocess itself; this way, the subprocess will get killed and we won't timeout on the outer barrier call - When waiting for the model subprocess, wait in a loop where we alternate between checking the queue and checking if the process is finished. This prevents us from waiting for the entire timeout when the worker process fails, and also prevents the worker process from hanging in the case where the queue fills up and blocks. Test Plan: Imported from OSS Reviewed By: wconstab, aazzolini Differential Revision: D40825608 Pulled By: davidberard98 fbshipit-source-id: d5c185a464e39b35b03b324a55b3beea022e6e84
Author
Parents
Loading