pytorch
1840f24d - [FSDP] Ensure that all ranks use the same order to iterate through optimizer states (#84654)

Commit View On GitHub

Commit

2 years ago

[FSDP] Ensure that all ranks use the same order to iterate through optimizer states (#84654) **Background:** Optimizer states are of the type `Dict[int, Dict[str, torch.Tensor]]` and the order of `dict.items()` is the creation order of keys. Without checkpoint (state_dict/load_state_dict), the creation order of keys depends on the implementation of the optimizer (e.g., Adam seems to creates `exp_avg` then `exp_avg_sq`). However, when loading states from a checkpoint, since the optimizer state are lazily initialized, the order depends on the user code (reading state_dict from IO). See the following example: ``` optimizer_state_dict = USER_CODE_TO_READ_STATE_FROM_IO() optimizer.load_state_dict(optimizer_state_dict) ``` The key order of `optimizer_state_dict` depends on `USER_CODE_TO_READ_STATE_FROM_IO` and there is no guarantee that the order is the same across ranks. **What Can Go Wrong?** After the first checkpoint load, the key order of optimizer may not be the same on different ranks. When users try to save another checkpoint, user will call `_unflatten_optim_state()` to save the optimizer states. Inside `_unflatten_optim_state()`, `dict.itmes()` will be called to iterate all the local optimizer state and `all_gather()` will be used to gather the local states. Since the order may be different across ranks, the gathered states are not correct. We have seen some models get NaN loss after the second checkpoint load because of this issue. **What This PR Does?** This PR implements a `sorted_items()` to return sorted `(key, value)` pairs. We can do this because the key is either an integer or a string. Differential Revision: [D39315184](https://our.internmc.facebook.com/intern/diff/D39315184/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84654 Approved by: https://github.com/awgu

Author

fegin

Committer

pytorchmergebot

Parents

22119495

pytorch 1840f24d - [FSDP] Ensure that all ranks use the same order to iterate through optimizer states (#84654)

Commit

pytorch
1840f24d - [FSDP] Ensure that all ranks use the same order to iterate through optimizer states (#84654)