DeepSpeed
d1e62ff2 - Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166)

Commit
69 days ago
Add DataStates-LLM: Asynchronous Checkpointing Engine Support (#7166) We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community: https://github.com/datastates/datastates-llm The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc). To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait(). We appreciate your feedback and look forward to a collaboration in this space. --------- Signed-off-by: amaurya <amaurya@anl.gov> Co-authored-by: amaurya <amaurya@anl.gov> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Parents
Loading