transformers
exclude fsdp from delay_optimizer_creation
#34140
Merged

exclude fsdp from delay_optimizer_creation #34140

eljandoubi
eljandoubi220 days ago

What does this PR do?

It passes the model and the optimizer to accelerate.prepare in order to enable fp8 mixed precision, if any.

Fixes #34024

Who can review?

Library:

-->

eljandoubi exclude fsdp from delay_optimizer_creation
cd0e8bb8
muellerzr
muellerzr commented on 2024-10-14
muellerzr220 days ago

Nice :) Can we add a test in tests/test_trainer.py? We can set env variables to configure Accelerate properly (ACCELERATE_MIXED_PRECISION="fp8" will auto-use TE)

HuggingFaceDocBuilderDev
HuggingFaceDocBuilderDev220 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eljandoubi
eljandoubi220 days ago (edited 220 days ago)πŸ‘ 1

The required tests are distributed tests. We need to verify FSDP functionality with and without FP8 mixed precision. The appropriate test file might be tests/trainer/test_trainer_fsdp.py.

eljandoubi add test case for trainer: FSDP mode and fp8 as mixed precision
d18e6424
eljandoubi rearrange imports
3344110c
eljandoubi ruff formatted
5055e2a2
eljandoubi
eljandoubi220 days ago

Is the test/trainer folder included in the CI tests? Where can I check the results for test_trainer_fsdp.py? @muellerzr @SunMarc

eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
656d7cc5
muellerzr
muellerzr219 days ago (edited 219 days ago)πŸ‘ 1

@eljandoubi we can't run them on the normal CI since GPU runners are not part of PR's.

Instead when ready I'll pull the PR down and run it myself

eljandoubi
eljandoubi219 days ago❀ 1

@muellerzr Thank you for the information. I have tested the branch in my code on a multi-node, multi-GPU setup using FSDP mode, both with and without FP8 mixed precision, and it worked as expected. Please let me know if you encounter any issues on your end.

eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
22cc58dd
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
4827a392
eljandoubi adapt _init_fsdp to fp8
4a84f0f0
eljandoubi use _init_fsdp only when resume_from_checkpoint
2e91c5f4
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
f5a3796a
eljandoubi In case of FDP, self.layer will be CheckpointWrapper which has no len…
af73835e
eljandoubi delete _init_fsdp
a2f30b0c
eljandoubi solve conflict
a838ba55
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
cc5b4c3f
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
d84336fc
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
acffb63c
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
78eed705
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
49882f88
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
9ac46640
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
5acf8e05
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
d7a01949
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
58d18f67
eljandoubi fix conflict
b94376d4
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
b9b9eb4f
eljandoubi
eljandoubi212 days ago

@muellerzr Any updates regarding this PR?

eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
f4665139
muellerzr
muellerzr approved these changes on 2024-10-23
muellerzr211 days ago

Thanks! Can you do pip install -e .[quality] followed by make fixup? I'll then pull it locally to test on my 4090 system and we should be set!

muellerzr muellerzr requested a review from ArthurZucker ArthurZucker 211 days ago
muellerzr muellerzr requested a review from SunMarc SunMarc 211 days ago
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
2948b297
eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
748270da
eljandoubi make fixup
0ec8e587
eljandoubi Merge branch 'huggingface:main' into fix_fsdp_with_fp8_in_trainer
09df2edc
eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
a3265d9c
eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
33902fdf
eljandoubi
eljandoubi209 days ago (edited 209 days ago)❀ 1

@muellerzr I have done make fixup.

eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
cfd81524
eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
571e58fd
eljandoubi Merge branch 'main' into fix_fsdp_with_fp8_in_trainer
02a63c7f
ArthurZucker
ArthurZucker approved these changes on 2024-10-28
ArthurZucker206 days ago

Don't worry we'll merge as is, failing tests are unrelated!

ArthurZucker ArthurZucker merged 8b3b9b48 into main 206 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone