DeepSpeed
28a196f7 - Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme (#8042)

Commit
5 days ago
Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme (#8042) This PR fixes a crash in `ZenFlowSelectiveAdamW_stage3` when ZeRO-3 offloads parameters to NVMe or CPU. - Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition on a device other than the gradients') and update them through a per-parameter path: swap each NVMe partition in and out one at a time, run AdamW on the compute device, and write the result back to where the partition lives. - Move `selected_indices` to the partition's device in `temp_copy_param`, and skip the resident pre-write in the offload bucket flush. - Leave the existing batched path unchanged for GPU-resident partitions. - Add unit tests covering the swap-in/update/swap-out path. ## Root Cause The selective optimizer updates each bf16 partition in place through `param.ds_tensor.data`, assuming it is resident on the compute device. When a partition is offloaded to NVMe, `ds_tensor.data` is a 0-dim placeholder, so `narrow()` raises "narrow() cannot be applied to a 0-dim tensor"; when it is on CPU it lives on a different device than the selected gradients, so indexing raises a device-mismatch error. Fixes #7686 Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author
Parents
Loading