peft
Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled.
#1450

Merged

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

pacman100 merged 4 commits into main from smangrul/fix-modules-to-save-for-ds-z3-init

pacman1001 year ago (edited 1 year ago)

What does this PR do?

When using DeepSpeed ZeRO Stage-3 with ZeRO init enabled, the deeepcopy performed in ModulesToSaveWrapper class doesn't work as expected and creates a new module with 0 parameters. As such, this results in following error when training:

assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError    return self.modules_to_save[self.active_adapter](*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
: {'id': 249, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {451}, 'ds_tensor.shape': torch.Size([0])}

This PR resolves this issue. To resolve it, the parameters of the modules specified in modules_to_save config option are gathered across processes using deepspeed.zero.GatheredParameters.
Example tested: https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py with following changes:

...
+ from peft import get_peft_model, LoraConfig
...
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+ config = LoraConfig(r=8, lora_alpha=16, task_type="SEQ_CLS")
+ model  = get_peft_model(model, config)
+ print(model)
...

- config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
+ config = {"lr": 2e-4, "num_epochs": 10, "seed": 42, "batch_size": 16}

without this PR, the above stated error is given and with this PR the fine-tuning happens successfully.

epoch 7: {'accuracy': 0.8480392156862745, 'f1': 0.8945578231292517}
[2024-02-09 12:26:16,228] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
epoch 8: {'accuracy': 0.8504901960784313, 'f1': 0.8957264957264958}
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
epoch 9: {'accuracy': 0.8602941176470589, 'f1': 0.902229845626072}

launch command:

accelerate launch --use_deepspeed --num_processes=2 --zero_stage=3 --zero3_init_flag=True --zero3_save_16bit_model=True --gradient_accumulation_steps=1 --gradient_clipping=1 --mixed_precision=fp16 nlp_example.py --mixed_precision fp16

Fixes: huggingface/transformers#24445 (comment)

Update other.py

13264a20

Update other.py

0e1b5fab

HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fix quality

bf65be02

BenjaminBossan approved these changes on 2024-02-09

BenjaminBossan1 year ago

Thanks for making modules_to_save work with DeepSpeed Zero-3. LGTM.

Update other.py

be8a0f11

pacman100 marked this pull request as ready for review 1 year ago

pacman100 merged a1c472f0 into main 1 year ago

pacman100 deleted the smangrul/fix-modules-to-save-for-ds-z3-init branch 1 year ago

Reviewers

BenjaminBossan

Assignees

No one assigned

Labels

None yet

Milestone

No milestone

peft Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450 Merged

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

What does this PR do?

peft
Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled.
#1450

Merged