pytorch
ed3884c3 - Fix timeout with ZeRO test_step() and test_step_with_closure() (#59648)

Commit
3 years ago
Fix timeout with ZeRO test_step() and test_step_with_closure() (#59648) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/59548 **Overview:** This fixes the timeout issues with `test_step()` and `test_step_with_closure()` for the `ZeroRedundancyOptimizer`. The existing tests partially assumed a `world_size` of `2` (hence why [this](https://github.com/pytorch/pytorch/pull/59622) seems to be a temporary fix). This change instead should avoid baking in that assumption and allow `world_size` to be flexible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59648 Test Plan: I tested with 2, 3, and 4 GPUs (and hence `world_size`s of 2, 3, and 4, respectively) via the AI AWS cluster. ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure ``` Reviewed By: jbschlosser Differential Revision: D28975035 Pulled By: andwgu fbshipit-source-id: 2cbaf6a35e22a95e19fc97e1b64e585e452e774e
Author
Parents
Loading