DeepSpeed
0a10bd42 - Fix alignment of optimizer states when loading (#5105)

Commit
1 year ago
Fix alignment of optimizer states when loading (#5105) The ZeRO 1/2 optimizer pads optimizer states according to NCCL's alignment. However, it does not account for NCCL's alignment when loading from an elastic checkpoint, resulting in improperly restored optimizer states. The existing test case only verifies parameter groups and fails to catch this specific issue. This PR addresses the misalignment and enhances the unit test to ensure that optimizer state tensors are correctly matched post-restoration.
Author
Parents
Loading