Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme (#8042)
This PR fixes a crash in `ZenFlowSelectiveAdamW_stage3` when ZeRO-3
offloads parameters to NVMe or CPU.
- Detect offloaded partitions (a 0-dim NVMe placeholder, or a partition
on a device other than the gradients') and update them through a
per-parameter path: swap each NVMe partition in and out one at a time,
run AdamW on the compute device, and write the result back to where the
partition lives.
- Move `selected_indices` to the partition's device in
`temp_copy_param`, and skip the resident pre-write in the offload bucket
flush.
- Leave the existing batched path unchanged for GPU-resident partitions.
- Add unit tests covering the swap-in/update/swap-out path.
## Root Cause
The selective optimizer updates each bf16 partition in place through
`param.ds_tensor.data`, assuming it is resident on the compute device.
When a partition is offloaded to NVMe, `ds_tensor.data` is a 0-dim
placeholder, so `narrow()` raises "narrow() cannot be applied to a 0-dim
tensor"; when it is on CPU it lives on a different device than the
selected gradients, so indexing raises a device-mismatch error.
Fixes #7686
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>