pytorch
14f63763 - Avoid using mp.Manager to report #GPUs needed in dist tests (#61409)

Commit
3 years ago
Avoid using mp.Manager to report #GPUs needed in dist tests (#61409) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409 We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (https://github.com/pytorch/pytorch/commit/2a4d312027f24898798e222b093e61a2427d5cee). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (https://github.com/pytorch/pytorch/commit/f0c46878c6c79fc9ac452ee72559daf0bddeb074), D23641618 (https://github.com/pytorch/pytorch/commit/567c51cce9cab86772824a589816e1644169a630), D28490129, D28794321 (https://github.com/pytorch/pytorch/commit/0128eb9a85ce2214858c5ea92d3e9de328d38468) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust. ghstack-source-id: 133236447 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29612614 fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df
Author
lw lw
Parents
Loading