Nvme offload checkpoint (#4707)
Previous PR #4416 had too many issues, closing that one and re-opening.
This PR includes a passing test.
This is a proposal for an implementation of checkpointing models when
training with ZeRO-3 with NVMe offload:
1. Currently, the names of the files used in the checkpoint are based on
the Python id of the parameter object, which is just the parameter's
address in memory. This is not stable across runs, which has two
disadvantages:
- The NVMe offloading files grow with every run of the model even if the
architecture did not change. This wastes disk space and, at least for
me, was a surprise when I first saw it. It is not related to
checkpointing.
- Without a way to match the file to the offloaded tensor we can't
reload the checkpoint.
We propose an alternative naming scheme. The parameters are named after
their ds_id instead of their Python id, and the tensors are named after
their state_name and (new) parameter id.
2. A model checkpoint now has to include all the offloaded tensor files.
During checkpoint save/load we copy all the tensor files to/from the
"offloaded_tensors" subdirectory of the checkpoint. We provide some
logging on the remaining space on the file system due to the potential
size of these files, especially as they accumulate in each checkpoint.
We do not copy the gradient files.
3. When loading the checkpoint, the optimizer already has prepared
buffers for swapping. We need to purge them so that they are replaced
with the freshly copied on-disk buffers from the checkpoint.
The key differences between this PR and the previous one:
- There's a test for a simple model with parameter/optimizer offload set
to cpu/cpu, cpu/nvme and nvme/nvme.
- Gradient files are not copied.
- FP16 and FP32 parameter buffers are handled correctly during load.
Fixes #2082.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>