DeepSpeed
d1e62ff2 - Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166)

Commit

120 days ago

Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166) We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community: https://github.com/datastates/datastates-llm The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc). To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait(). We appreciate your feedback and look forward to a collaboration in this space. --------- Signed-off-by: amaurya <amaurya@anl.gov> Co-authored-by: amaurya <amaurya@anl.gov> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

References

#7166 - Add DataStates-LLM: Asynchronous Checkpointing Engine Support

Author

mauryaavinash95

Parents

64c0052f

DeepSpeed d1e62ff2 - Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166)

DeepSpeed
d1e62ff2 - Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166)