Sort params by size (decreasing)

Commit

3 years ago

Sort params by size (decreasing) Summary: Pull Request: https://github.com/pytorch/pytorch/pull/59586 Task: https://www.internalfb.com/tasks/?t=90847711 **Overview:** Suppose we have `n` items with positive integer sizes and `k` buckets. We want to assign items to buckets with the goal of uniformity. The precise criteria for uniformity can vary: e.g. minimize the maximum size, maximize the minimum size, etc. This is known as [multiway number partitioning](https://en.wikipedia.org/wiki/Multiway_number_partitioning). ZeRO's partitioning task reduces to solving this problem. In particular, this is the subproblem to be solved for each `param_group` in `self.param_groups`, where the parameters are the items and the ranks give the buckets. The existing implementation uses the linear-time [greedy number partitioning algorithm](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Linear-time_algorithm), which assigns the next tensor-parameter to the process with the smallest total parameter size so far. In this task, I explore the [extension](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Improved_algorithm) where each parameter group is sorted by decreasing size before applying the greedy algorithm, requiring linearithmic time (as dominated by the sort). **Experiments** The mean number of parameters represents a perfectly uniform allocation and hence the ideal allocation (which may be even better than the optimal partition). In the following tables, I present the maximum number of parameters for any one process and the difference from the mean in parentheses for ResNet-50, ResNet-152, and BERT (the bare BERT model). The best-performing partitioning strategy for each model is bolded. Two processes: | Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params | | --- | --- | --- | --- | | ResNet-50 | 13,249,600 (471,084) | **12,794,816 (16,300)** | 12,778,516 | | ResNet-152 | 30,567,488 (471,084) | **30,111,424 (15,020)** | 30,096,404 | | BERT | **54,749,184 (8,064)** | 55,327,488 (586,368) | 54,741,120 | Four processes: | Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params | | --- | --- | --- | --- | | ResNet-50 | 7,524,864 (1,135,606) | **6,436,864 (47,606)** | 6,389,258 | | ResNet-152 | 16,232,192 (1,183,990) | **15,090,152 (41,950)** | 15,048,202 | | BERT | **28,151,040 (780,480)** | 28,352,256 (981,696) | 27,370,560 | --- I also investigated the latency of `optimizer.step()` for the different partitioning algorithms. I measured the latency for 30 iterations and took the mean latency per process (excluding the first iteration due to cache coldness). In the following tables, I present the maximum of those mean latencies over all processes and the standard deviation of the latencies contributing to that maximum. Again, the best-performing partitioning strategy for each model is bolded. All entries are presented in seconds and used `gloo` backend. Two processes: | Model | Max `optimizer.step()` Time - Greedy (Std.) | Max `optimizer.step()` Time - Greedy-Sorted (Std.) | | --- | --- | --- | | ResNet-50 | **0.060 (0.002)** | 0.061 (0.002) | | ResNet-152 | 0.166 (0.003) | **0.160 (0.004)** | | BERT | 0.220 (0.009) | **0.199 (0.006)** | Four processes: | Model | Max `optimizer.step()` Time - Greedy | Max `optimizer.step()` Time - Greedy-Sorted | | --- | --- | --- | | ResNet-50 | 0.094 (0.004) | **0.093 (0.004)** | | ResNet-152 | **0.228 (0.011)** | 0.231 (0.009) | | BERT | **0.328 (0.015)** | 0.329 (0.021) | Based on the standard deviations, the differences in the latency measurements across the different algorithms appear to be within the uncertainty in the measurement itself. Hence, it is difficult to argue that one algorithm is clearly the fastest. --- `zero.py` is my experiment script, and I use the AI AWS cluster. The run command looks like: ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python zero.py -b nccl greedy 2 4 ``` This runs the experiment script on an instance with 4 GPUs using `nccl` backend, outputting to a directory named `greedy/`, and using world sizes of 2 and 4. An analogous command can be used after modifying `partition_parameters()`, e.g. replacing `greedy` with `greedy_sorted` as the output directory name. Then, to run the analysis script: ``` python analyze.py greedy greedy_sorted ``` For more details on the experiment code, refer to: https://www.internalfb.com/diff/D28946756 **Notes:** There exists an optimal solution to this partitioning problem. An algorithm that finds such a solution is the [complete greedy algorithm (CGA)](https://en.wikipedia.org/wiki/Greedy_number_partitioning#An_exact_algorithm), which reduces to the brute-force combinatorial search in the worst case. There exist heuristics to improve the `k = 2` case (i.e. when there are two processes); however, given that `n` in typical use cases is very large, any algorithm that is quadratic or slower is unrealistic. Other exact algorithms are similarly exponential in the worst case, rendering them intractable. Given this, I do not currently see a need for future proofing the partitioning algorithm against the introduction of algorithms beyond the naive greedy and the sorted greedy algorithms. --- In the current ZeRO implementation, the core `partition_parameters()` computation happens twice upon initialization (i.e. call to `__init__()`): first from a call to `_param_to_rank()` (i.e. an access to `_param_to_rank`) and then from a call to `_update_trainable()`. `_update_trainable()` sees that no optimizer has been constructed yet, so it clears the cache, eliminating the first `partition_parameters()` computation and performing a redundant re-computation. Here is a typical trace: - [The ZeRO optimizer object is initialized, calling `__init__()`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L142) - [In `__init__()`, `self._device` is set, so it accesses `self._per_device_params`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L182) - [`self._per_device_params` is not cached, so it accesses `self._param_to_rank`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L340) - [`self._param_to_rank` is not cached, so it calls `partition_parameters()`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L353) (first call to `partition_parameters()`) - [`__init__()` later calls `_update_trainable()`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L185) - [In `_update_trainable()`, `self` does not have `attr` `"optim"`, so it clears the cached objects (notably, `self._partition_parameters_cache`).](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L591) - [`_update_trainable()` calls `self.partition_parameters()`.](https://github.com/pytorch/pytorch/blob/d125694d0bc4e02de9a54ce485b31ca333559203/torch/distributed/optim/zero_redundancy_optimizer.py#L593) (second call to `partition_parameters()`) Based on the discussion [here](https://github.com/pytorch/pytorch/pull/59410), this recomputation is unintentional and should be addressed in a future diff. Test Plan: I verified that the total number of parameters across the processes was consistent after the partitioning algorithm change. Otherwise, no additional modifications were made to existing tests. Reviewed By: mrshenli Differential Revision: D28946755 fbshipit-source-id: 7ad66a21a963555b3b2e693ba8069d2dddc94c60

Author

awgu

Committer

facebook-github-bot

Parents

935057fc

pytorch ea1de87f - Sort params by size (decreasing)

pytorch
ea1de87f - Sort params by size (decreasing)