DeepSpeed
f2bb1ec6 - Add Feature Universal Checkpoint for AutoTP (#7908)

Commit
12 days ago
Add Feature Universal Checkpoint for AutoTP (#7908) Hi DeepSpeed team — thanks for your time reviewing this PR. ## Summary Add Universal Checkpoint (UC) metadata support for DeepSpeed AutoTP to enable saving and resuming from Universal Checkpoints. ## Motivation AutoTP partitions parameters across TP ranks. To make checkpoints portable and restorable, we need a stable UC metadata representation that can be collected at save time and consumed at restore time. ## What’s in this PR - Collect AutoTP-specific Universal Checkpoint metadata for TP-partitioned parameters. - Provide restore/merge helpers that normalize shapes and correctly interpret the saved conversion/partition view. - Keep existing (non-AutoTP / non-UC) checkpoint paths unchanged (no behavior change expected for other users). ## Testing - `pytest -q tests/unit/runtime/tensor_parallel/test_autotp_universal_checkpoint.py` - `pytest -q tests/unit/checkpoint/test_autotp_universal_checkpoint.py` ## Request for feedback Could you please take a look at the UC metadata schema and let me know if you’d prefer any changes to naming, field placement, or compatibility expectations? I’m happy to iterate quickly based on your guidance. ## References - Refs: #7861 (Q2 2026 roadmap — AutoTP Universal Checkpoint support) --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Author
Parents
Loading