Avoid using mp.Manager to report #GPUs needed in dist tests (#61409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61409
We used a multiprocessing.Manager in order to share TEST_SKIPS between the parent and the child processes. TEST_SKIPS is a global variable that defines a unique error code for each "error type", so that the parent can figure out the reason a child exited. While originally this mapping was immutable, at some point we allowed children to modify the parent's value of that mapping so they could update the message for the `multi-gpu` error to make it reflect how many GPUs were really needed. This occurred in D23285790 (https://github.com/pytorch/pytorch/commit/2a4d312027f24898798e222b093e61a2427d5cee). Since then this Manager proved to be quite problematic, especially around thread safety, races, TSAN, ... (see D22753459 (https://github.com/pytorch/pytorch/commit/f0c46878c6c79fc9ac452ee72559daf0bddeb074), D23641618 (https://github.com/pytorch/pytorch/commit/567c51cce9cab86772824a589816e1644169a630), D28490129, D28794321 (https://github.com/pytorch/pytorch/commit/0128eb9a85ce2214858c5ea92d3e9de328d38468) and D29585862). This seems like an awful lot of trouble for such a small functionality. Here I propose we drop Manager and instead get the same result by using separate error codes for each number of GPUs. It should be much simpler and thus more robust.
ghstack-source-id: 133236447
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29612614
fbshipit-source-id: 8ad0fedcb7796e5832a0eb196f8fdc147e02b3df