DeepSpeed
b112c99e - Fix loading a universal checkpoint (#5263)

Commit

2 years ago

Fix loading a universal checkpoint (#5263) This PR fixes the following two points regarding checkpoint loading. - Load optimizer states With [this PR](https://github.com/microsoft/DeepSpeed/pull/5104), we removed optimizer's `step()` on initialization. This made the DS's parameter update match with PyTorch's normal behavior. However, we don't have keys in optimizer states any more when we load a checkpoint. For legacy/elastic checkpoints, the PR changed the checkpoint loaders to create keys and buffers on loading. However, the loader for universal checkpoints still relies on keys in optimizer states. As the result, loading a universal checkpoint fails. This PR fixes the loader to find optimizer state keys from a given checkpoint. - Resume step count https://github.com/microsoft/DeepSpeed/pull/5263/commits/2943e6ab7e156946a018ab2a08c7f3ba45b55e01 The checkpoint loader for a universal checkpoint resumes step count for optimizer only when the param group already has `step`. But some optimizers creates the key `step` in a param group at the first call of `step()` (e.g. Apex [Fused Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154). In this case, the step count is not restored. This PR changes this behavior to always set step count in a param group. This PR also stop incrementing the step count when loading. I didn't see why we need to increment the step count for my small example, but we may need a discussion to consider various cases.

References

#5263 - Fix loading a universal checkpoint

Author

tohtana

Parents

2df8e234

DeepSpeed b112c99e - Fix loading a universal checkpoint (#5263)

DeepSpeed
b112c99e - Fix loading a universal checkpoint (#5263)