pytorch
a4e0368c - Comment on tests reliance on ZeRO's partitioning algo (#59713)

Commit View On GitHub

Commit

3 years ago

Comment on tests reliance on ZeRO's partitioning algo (#59713) Summary: Addresses https://github.com/pytorch/pytorch/issues/59548 **Overview:** Recently, we changed ZeRO's partitioning algorithm to first sort the parameters by decreasing size and then greedily allocate to shards. See [here](https://github.com/pytorch/pytorch/commit/ea1de87f4b98d4b3d9c70fae66bdac3e9aa4f3b7). The current tests `test_sharding()` and `test_add_param_group()` check for a uniform partitioning, which is not achieved with the old naive greedy partitioning algorithm for general world sizes but is achieved with the new sorted-greedy algorithm. This reliance is not ideal, but for now, we opt to simply add comments to document the dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59713 Test Plan: I tested for world sizes of 1, 2, 3, and 4 via the AI AWS cluster: ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group ``` However, because the train queue (which offers instances with 8 GPUs) is not working at the moment, I was unable to test for world sizes of 5+. Nonetheless, I believe that they should still work. First, consider `test_sharding()`. Given the sorted-greedy algorithm, each shard will be assigned one of the parameters with size `9`, then one of the parameters with size `7`, then `5`, and finally `3`. Hence, each will have a uniform partition. Now, consider `test_add_param_group()`. Similarly, the same allocation behavior occurs, only the last shard is not assigned the final parameter with size `3` to begin. However, after adding the new `param_group` with the parameter with size `3`, a re-partitioning occurs. The first `param_group` is partitioned as before, and the parameter with size `3` in the new `param_group` is assigned to the last shard since it has the minimal total size. Thus, in the end, all shards have a uniform partition. Reviewed By: mrshenli Differential Revision: D28996460 Pulled By: andwgu fbshipit-source-id: 22bdc638d8569ed9a20836812eac046d628d6df2

Author

awgu

Committer

facebook-github-bot

Parents

25179ecb

pytorch a4e0368c - Comment on tests reliance on ZeRO's partitioning algo (#59713)

Commit

pytorch
a4e0368c - Comment on tests reliance on ZeRO's partitioning algo (#59713)