Clean Up ZeRO (#60285) - SemanticDiff

Commit

4 years ago

Clean Up ZeRO (#60285) Summary: **Overview:** Being relatively new to PyTorch and ZeRO, I found parts of the code slightly hard to follow. This change strives to clean up the `ZeroRedundancyOptimizer` code in `zero_redundancy_optimizer.py` by reorganizing some computations, making variable names more explicit and consistent, and unifying terminology in the documentation. The goal is for the code to be easier to extend afterwards. **Changes:** 1) `state_dict()`: The [logic](https://github.com/pytorch/pytorch/blob/85517a2b700a5abc0b38f53ce8c99404cd67db79/torch/distributed/optim/zero_redundancy_optimizer.py#L510) for updating the global `state_dict` with each rank's local `state_dict` is simplified and made more explicit. Notably, the `dict` [`local_index_to_param_id`](https://github.com/pytorch/pytorch/blob/85517a2b700a5abc0b38f53ce8c99404cd67db79/torch/distributed/optim/zero_redundancy_optimizer.py#L513) is unneeded. It maps `local_pg["params"][i]` to `id(global_pg["params"][i])`, so it is equivalent to make a single pass over both lists in tandem, effectively iterating over `i`, without a need for the explicit `dict`. 2) `_update_trainable()`: The function [initializes](https://github.com/pytorch/pytorch/blob/85517a2b700a5abc0b38f53ce8c99404cd67db79/torch/distributed/optim/zero_redundancy_optimizer.py#L597) the local optimizer if it does not exist. I am unaware of any reason for the local optimizer to be destroyed after initialization, so I moved that logic to its own function `_init_local_optimizer()`, which is called once in the constructor. After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654706728), I removed the function `_update_trainable()` itself in favor of adding a check for `parameters_as_bucket_view` in `build_param_buckets()` directly. 3) `rank_local_state_dict()`: This [function](https://github.com/pytorch/pytorch/blob/85517a2b700a5abc0b38f53ce8c99404cd67db79/torch/distributed/optim/zero_redundancy_optimizer.py#L528) is currently broken. It appears to be legacy and relies on the input `state_dict` to have the key `"partitions"`. For now, I have removed it and added an [issue](https://github.com/pytorch/pytorch/issues/60284). Is it a notable use case to want to access another rank's `state_dict` in particular (as opposed to consolidating the entire state and then accessing)? 4) `local_state_dict():` After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r655571043), I removed the function. 5) `partition_parameters()`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654708183), I renamed the function to `_partition_parameters()` to mark it as private. 6) `_param_to_index`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654828100), I changed the key to be the parameter itself rather than its integer ID. 7) `buckets`: I renamed the data structure to `_buckets` to mark it as private. 8) Terminology: I tried to reduce the set of terms being used instead of juggling a number of synonyms. In particular, I made an effort to distinguish between "local" and "global" and to make names more indicative of typing. 9) Style: Per the [PyTorch contributing guide](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation), I made all docstrings abide by the 80 character limit, except for the one [line](https://github.com/andwgu/pytorch/blob/554891f6faa764c76dec4afb1107cb5aa88ef589/torch/distributed/optim/zero_redundancy_optimizer.py#L142) showing the example ZeRO usage. Some code lines violate the limit for readability. Also, I unified some of the minor stylistic usages out of habit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60285 Test Plan: The test suite passes as expected (on the AI AWS cluster): ``` gpurun python test/distributed/optim/test_zero_redundancy_optimizer.py ``` I visually inspected the generated HTML doc (as generated following [this](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation)). Reviewed By: mrshenli Differential Revision: D29320726 Pulled By: andwgu fbshipit-source-id: 23f69a19ecc5e877a38fe1df0da11329428311dd

Author

Andrew Gu

Committer

facebook-github-bot

Parents

56481f97

pytorch f0e4e4be - Clean Up ZeRO (#60285)

pytorch
f0e4e4be - Clean Up ZeRO (#60285)