DeepSpeed
3bdebc03 - Fix/fix autotp universal checkpoint ci (#7937)

Commit
6 days ago
Fix/fix autotp universal checkpoint ci (#7937) The full CI test [fails](https://github.com/deepspeedai/DeepSpeed/actions/runs/23735417401/job/69138729446) throwing "RuntimeError: Cannot re-initialize CUDA" because of tests for universal checkpoint and AutoTP. It happens because they run `torch.cuda.current_device()` under `pytest --forked`. As the tests only touch universal checkpoint metadata, we won't need to call it. This PR skips constructor-time AutoTP materialization when `mp_group` is `None`. Partitioning still happens in the real AutoTP usage where an actual model-parallel group is given. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading