Add Feature Universal Checkpoint for AutoTP (#7908)
Hi DeepSpeed team — thanks for your time reviewing this PR.
## Summary
Add Universal Checkpoint (UC) metadata support for DeepSpeed AutoTP to
enable saving and resuming from Universal Checkpoints.
## Motivation
AutoTP partitions parameters across TP ranks. To make checkpoints
portable and restorable, we need a stable UC metadata representation
that can be collected at save time and consumed at restore time.
## What’s in this PR
- Collect AutoTP-specific Universal Checkpoint metadata for
TP-partitioned parameters.
- Provide restore/merge helpers that normalize shapes and correctly
interpret the saved conversion/partition view.
- Keep existing (non-AutoTP / non-UC) checkpoint paths unchanged (no
behavior change expected for other users).
## Testing
- `pytest -q
tests/unit/runtime/tensor_parallel/test_autotp_universal_checkpoint.py`
- `pytest -q tests/unit/checkpoint/test_autotp_universal_checkpoint.py`
## Request for feedback
Could you please take a look at the UC metadata schema and let me know
if you’d prefer any changes to naming, field placement, or compatibility
expectations? I’m happy to iterate quickly based on your guidance.
## References
- Refs: #7861 (Q2 2026 roadmap — AutoTP Universal Checkpoint support)
---------
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>