Looks good so far, let me know if you need help testing / debugging
It works, needs hyperparameter tuning. If you could take it to a test run, that would be great!
Thanks a lot for working on this. Just left some comments. Most importantly:
Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.
Everything else looks good !
776 | # 8. Add LoRA to the student U-Net, only the LoRA projection matrix will be updated by the optimizer. | ||
777 | lora_config = LoraConfig( | ||
778 | r=args.lora_rank, | ||
779 | target_modules=[ | ||
780 | "to_q", | ||
781 | "to_k", | ||
782 | "to_v", | ||
783 | "to_out.0", | ||
784 | "proj_in", | ||
785 | "proj_out", | ||
786 | "ff.net.0.proj", | ||
787 | "ff.net.2", | ||
788 | "conv1", | ||
789 | "conv2", | ||
790 | "conv_shortcut", | ||
791 | "downsamplers.0.conv", | ||
792 | "upsamplers.0.conv", | ||
793 | "time_emb_proj", |
for later:
We could also think of making this an argument.
Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.
Could you post the snippet with which you tried and the error you got?
Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.
Could you post the snippet with which you tried and the error you got?
When I try to load the lora saved with script using
pipe.load_lora_weights(path_to_saved_lora, weight_name="pytorch_lora_weights.safetensors")
I get
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [13], line 1
----> 1 pipe.load_lora_weights("valhalla/lcm-sdxl-distill-lora-k5", weight_name="pytorch_lora_weights.safetensors")
File ~/diffusers/src/diffusers/loaders.py:3233, in StableDiffusionXLLoraLoaderMixin.load_lora_weights(self, pretrained_model_name_or_path_or_dict, adapter_name, **kwargs)
3230 if not is_correct_format:
3231 raise ValueError("Invalid LoRA checkpoint.")
-> 3233 self.load_lora_into_unet(
3234 state_dict, network_alphas=network_alphas, unet=self.unet, adapter_name=adapter_name, _pipeline=self
3235 )
3236 text_encoder_state_dict = {k: v for k, v in state_dict.items() if "text_encoder." in k}
3237 if len(text_encoder_state_dict) > 0:
File ~/diffusers/src/diffusers/loaders.py:1630, in LoraLoaderMixin.load_lora_into_unet(cls, state_dict, network_alphas, unet, low_cpu_mem_usage, adapter_name, _pipeline)
1627 if "lora_B" in key:
1628 rank[key] = val.shape[1]
-> 1630 lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=True)
1631 lora_config = LoraConfig(**lora_config_kwargs)
1633 # adapter_name
File ~/diffusers/src/diffusers/utils/peft_utils.py:122, in get_peft_kwargs(rank_dict, network_alpha_dict, peft_state_dict, is_unet)
120 rank_pattern = {}
121 alpha_pattern = {}
--> 122 r = lora_alpha = list(rank_dict.values())[0]
124 if len(set(rank_dict.values())) > 1:
125 # get the rank occuring the most number of times
126 r = collections.Counter(rank_dict.values()).most_common()[0][0]
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
@patil-suraj here's a summary of the changes:
disable_adapter()
context (cc: @younesbelkada)The following should work now:
from diffusers import StableDiffusionXLPipeline
import torch
pipeline_id = "stabilityai/stable-diffusion-xl-base-1.0"
ckpt_id = "sayakpaul/lora-lcm-sdxl-new"
pipeline = StableDiffusionXLPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights(ckpt_id, use_auth_token=True)
We're now serializing the final checkpoint in the diffusers
native format.
Currently, conducting an experiment with the following command:
CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
--pretrained_teacher_model=${MODEL_NAME} \
--pretrained_vae_model_name_or_path=${VAE_PATH} \
--output_dir="lora-lcm-sdxl-new" \
--mixed_precision="fp16" \
--dataset_name=$DATASET_NAME \
--resolution=1024 \
--train_batch_size=16 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--use_8bit_adam \
--learning_rate=1e-6 --loss_type="huber" \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=5000 \
--checkpointing_steps=100 \
--validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
It'd be very nice to have your 👀 on this.
cc @patil-suraj
Looking very good, just left some suggestions. We should remove the empty prompt embeds options here, since we train the model with CFG it's not necessary. Apart from this, it's in a pretty good state already.
598 | if env_local_rank != -1 and env_local_rank != args.local_rank: | ||
599 | args.local_rank = env_local_rank | ||
600 | |||
601 | if args.proportion_empty_prompts < 0 or args.proportion_empty_prompts > 1: | ||
602 | raise ValueError("`--proportion_empty_prompts` must be in the range [0, 1].") |
This should be removed.
822 | # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format | ||
823 | def save_model_hook(models, weights, output_dir): | ||
824 | if accelerator.is_main_process: | ||
825 | unet_ = accelerator.unwrap_model(unet) | ||
826 | # save weights in peft format to be able to load them back | ||
827 | unet_.save_pretrained(output_dir) |
We should save the intermediate checkpoints in the loadable format as well. Since sometimes intermediate checkpoints are better than final. So it's convenient to be able to just load and do inference with the checkpoint without having to manually convert the peft state dict.
WDYT about saving both in the peft
format (as is currently being done during intermediate checkpoints) and also in the diffusers
format?
This way:
peft
via load_adapter()
.load_lora_weights()
.Sounds good!
1201 | # Get teacher model prediction on noisy_latents and conditional embedding | ||
1202 | # Notice that we're disabling the adapter layers within the `unet` and then it becomes a | ||
1203 | # regular teacher. This way, we don't have to separately initialize a teacher UNet. | ||
1204 | with torch.no_grad() and torch.autocast( | ||
1205 | str(accelerator.device), dtype=weight_dtype | ||
1206 | ) and unet.disable_adapter(): |
Since we are just using one unet
now, this autocast
might not be necessary. The unet
is already prepared with accelerate
which wraps the model in autocast
when doing mixed-precision training
Kept it because the unet
that's being trained is also under autocast
(taken from the original script).
Curious to know how are the results with smaller datasets :)
@DN6 we can't add pip install peft
in the PR test workflow either as that swaps the training backend to use peft
which is not fully supported for the other training examples.
I see the following as a potential workaround. In the workflow, we run all the example tests except for test_text_to_image_lcm_lora_sdxl
without installing peft
. Then we add a command to install peft
and then just test test_text_to_image_lcm_lora_sdxl
. Does this work for you?
Later, when the examples are fully equipped with peft
we can revisit it. LMK.
Update: With #5388, this might already be fixed (cc: @younesbelkada).
1171 | # regular teacher. This way, we don't have to separately initialize a teacher UNet. | ||
1172 | using_cuda = "cuda" in str(accelerator.device) | ||
1173 | with torch.no_grad() and torch.autocast( | ||
1174 | str(accelerator.device), dtype=weight_dtype if using_cuda else torch.bfloat16, enabled=using_cuda |
@patil-suraj I am not super happy about how I am doing it here but it's the only way I found to make the test pass on a CPU. Let me know if you have better ideas.
@sayakpaul I can confirm all issues with respect to failing example CI are fixed with #5388. Perhaps we can merge first #5388 and I can work on the saving adapter config as a follow up PR. What do you think?
I will have a look at #5388.
Hi @sayakpaul
Thanks for your great efforts!
I left few comments regarding the current state of the fine-tuning script, right now the script deviates a bit from the PEFT integration in diffusers as it utilises the PeftModel
interface. Although it works fine, it might lead to some confusion (why load_adapter
for inference that modifies in-place the model and get_peft_model
that suddenly returns a PeftModel
for training? ). IMO when using PEFT integration (in diffusers and/or transfomers) we should see peft as an utility library rather than a modeling library
I propose to slightly refactor the training logic, which should also work as expected, following #5388 - If that's something we want to ship asap that's totally fine and can be re-worked in a follow up PR, let me know if you need any help!
@younesbelkada I resolved all your comments. Could you take a look again? Thanks so much in advance! In particular, how I am disabling and enabling the adapters.
Tests will pass after #5388 is merged.
Also, @patil-suraj, I launched an experiment yesterday:
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lora-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
--pretrained_teacher_model=${MODEL_NAME} \
--pretrained_vae_model_name_or_path=${VAE_PATH} \
--output_dir="lora-lcm-sdxl-new" \
--mixed_precision="fp16" \
--dataset_name=$DATASET_NAME \
--resolution=1024 \
--train_batch_size=16 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--use_8bit_adam \
--learning_rate=1e-5 --loss_type="huber" \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=5000 \
--checkpointing_steps=100 \
--validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
The results (https://wandb.ai/sayakpaul/text2image-fine-tune/runs/ahqu0wvp) are random noise. Anything crucial I am missing? Any usual suspect?
A couple of things I saw while testing:
--validation_steps=5
to verify that I could train, evaluate and go back to training successfully. But when I launched the training run for real with evaluation and checkpointing every 100 steps, it crashed at the 200 step mark. I'm not sure what could be the reason for this. I decreased my batch size (slightly, 12 to 10) after that, and the same thing happened againI ran out of GPU memory while running the second evaluation. I verified my max batch size by running the script using --validation_steps=5 to verify that I could train, evaluate and go back to training successfully. But when I launched the training run for real with evaluation and checkpointing every 100 steps, it crashed at the 200 step mark. I'm not sure what could be the reason for this. I decreased my batch size (slightly, 12 to 10) after that, and the same thing happened again
Didn't see this with the following (I am using a single GPU from the DGX):
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lora-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
--pretrained_teacher_model=${MODEL_NAME} \
--pretrained_vae_model_name_or_path=${VAE_PATH} \
--output_dir="lora-lcm-sdxl-new" \
--mixed_precision="fp16" \
--dataset_name=$DATASET_NAME \
--resolution=1024 \
--train_batch_size=16 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--use_8bit_adam \
--lora_rank=16 \
--learning_rate=1e-5 --loss_type="huber" --adam_weight_decay=0.0 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=5000 \
--checkpointing_steps=100 \
--validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
There are many images smaller than 1024 pixels in the pokemon-blip-captions dataset (didn't count how many). Would that be a problem for SDXL training?
Could be but SDXL doesn't really excel at lower resolutions than that from what I have seen. Currently, I don't have a good workaround other than suggesting the use of an upscaler to upsample the images to 1024x1024.
Impressive work @sayakpaul ! All good on PEFT side!
What about other files with the same problem?
More info:
huan085128/lcm_lora#1
https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py#L506C15-L506C15
@ArturFormella this PR should provide a good reference for anyone willing to adjust the scripts accordingly. Plus @pcuenca is working on #5908.
@BenjaminBossan pinging you since Sourab is busy and Younes is OOO.
In this PR, I am doing what I described here: #5778 (comment). The training, however, seems to be completely off as it's yielding just noise: https://wandb.ai/sayakpaul/text2image-fine-tune/runs/ax6w9q0x?workspace=user-sayakpaul. I don't think it's just a matter of hyperparameters.
I suspect some peft
related changes that I did in this PR might have caused it. I am unable to see any potential suspects in comparison to what we have in https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py (don't worry about the dataloading code). Would you be able to take a look and advise regarding the peft
related changes?
Hi @sayakpaul I took a look at the script and also compared it to the other one you linked. I have no experience in training diffusion models, so I cannot judge if the generated output could be solved by some better hyper-parameters or if something else is completely off. The diff between the two scripts is quite large, so I could not pinpoint what difference causes the trouble.
The main difference that involves PEFT seems to be the way that the teacher network is being used via disable_adapters()
. IIUC, the teacher net is completely frozen, whereas the student net is teacher + LoRA parameters on top, the latter being updated. Maybe something worth checking would be if requires_grad
is set correctly on all params after disable_adapters()
is called, and whether it is re-set correctly after enable_adapters()
is being called. The diffusers implementation of these methods looks correct to me, but if training issues are PEFT related, this is what I would check first.
Thanks @BenjaminBossan for your inputs. I will get to them and let you know.
@BenjaminBossan sorry for the delay here. Big log file here: https://huggingface.co/datasets/sayakpaul/sample-datasets/blob/main/logs.txt.
Some findings:
params_to_optimize
here: https://github.com/huggingface/diffusers/blob/8fecdda23cefc30053a511db597b9c017b8fcaa1/examples/consistency_distillation/train_lcm_distill_lora_sdxl.py#L864C1-L865C64. Somewhat surprisingly, length of params_to_optimize
(when packed as a list) is zero. This is not expected, no? However, params_to_optimize_named
prints all the trainable params correctly.print(len(list(params_to_optimize)), len(list(params_to_optimize_after_enable)))
gives 0 1576
, respectively. I think 1576
is expected and should have been the case for params_to_optimize
as well.Somewhat surprisingly, length of
params_to_optimize
(when packed as a list) is zero. This is not expected, no?
The issue here is that the result of filter
is lazy, like a generator expression. Since you pass the same expression to the optimizer right after, it is exhausted, so when you call list
on it after that, it is empty. Almost certainly, it would contain the same number if you cast it to a list before passing it to the optimizer.
The other numbers are, as you say, pretty much what's expected, so my suspicion that the issue could be related to requires_grad
is very unlikely at this point.
Instead of initializing two UNets (teacher and student), it leverages the
disable_adapters()
andenable_adapters()
functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.
Do you have a reference for using this trick for training? I wonder if there could be some kind of flaw in that approach which could explain the findings.
Do you have a reference for using this trick for training? I wonder if there could be some kind of flaw in that approach which could explain the findings.
This is taken from https://huggingface.co/blog/stackllama. It's implemented in TRL too:
https://github.com/huggingface/trl/blob/baa8f09cb35057c03b33d898f90c7e8ff958ed9b/trl/trainer/ppo_trainer.py#L468
@lvwerra could you comment more on the above with respect to:
Instead of initializing two UNets (teacher and student), it leverages the disable_adapters() and enable_adapters() functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.
I think @younesbelkada would be the right person to answer this :)
@BenjaminBossan the main culprit was not putting torch.no_grad()
where it was needed. I added that in 539bda3.
Now, things are slowly showing progress :-)
the main culprit was not putting
torch.no_grad()
Awesome that you figured it out 🎉
Need to propagate the changes from #6145 once it's merged.
@dg845 as well if you want to give this a look :-)
192 | |||
193 | |||
194 | # From LCMScheduler.get_scalings_for_boundary_condition_discrete | ||
195 | def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0): | ||
196 | c_skip = sigma_data**2 / ((timestep / 0.1) ** 2 + sigma_data**2) | ||
197 | c_out = (timestep / 0.1) / ((timestep / 0.1) ** 2 + sigma_data**2) ** 0.5 | ||
198 | return c_skip, c_out |
We could make this a little more general by using the timestep_scaling
argument (and perhaps expose this as an argument):
def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0): | |
c_skip = sigma_data**2 / ((timestep / 0.1) ** 2 + sigma_data**2) | |
c_out = (timestep / 0.1) / ((timestep / 0.1) ** 2 + sigma_data**2) ** 0.5 | |
return c_skip, c_out | |
def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0): | |
scaled_timestep = timestep_scaling * timestep | |
c_skip = sigma_data**2 / (scaled_timestep**2 + sigma_data**2) | |
c_out = scaled_timestep / (scaled_timestep**2 + sigma_data**2) ** 0.5 | |
return c_skip, c_out |
(The current function is the same as in examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py
so if we make this change we should probably also propagate it to the WebDataset scripts.)
Would prefer reflecting that after it's changed in WDS so that it's easier to track.
+1, let's adapt this for both scripts.
789 | |||
790 | # 9. Add LoRA to the student U-Net, only the LoRA projection matrix will be updated by the optimizer. | ||
791 | lora_config = LoraConfig( | ||
792 | r=args.lora_rank, | ||
793 | lora_alpha=args.lora_rank, |
Would it make sense to allow lora_alpha
to be set independently of the r
/lora_rank
, to allow the LoRA layer scaling to be controlled?
Maybe in a future PR. Since this PR is almost just a copy-paste of the WDS version.
makes sense.
@patil-suraj @dg845 ran a recent experiment with the following:
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lcm-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
--pretrained_teacher_model=${MODEL_NAME} \
--pretrained_vae_model_name_or_path=${VAE_PATH} \
--output_dir="lora-lcm-sdxl-new" \
--mixed_precision="fp16" \
--dataset_name=$DATASET_NAME \
--resolution=1024 \
--train_batch_size=24 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--use_8bit_adam \
--lora_rank=64 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=10000 \
--checkpointing_steps=3000 \
--validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
WandB: https://wandb.ai/sayakpaul/text2image-fine-tune/runs/tv3zw00t
Weights: https://huggingface.co/sayakpaul/pokemons-lora-lcm-sdxl
Given it's only 10k steps and I haven't ablated the hyperparameters, I'd say it's still good enough. WDYT? I'd be keen on merging this soon as it reduces the memory requirements significantly by following good PEFT practices.
WDYT?
Sounds good! Usually in my experiment 2-3k steps were enough, 10k results in overfitting with large BS.
@patil-suraj I added the following for the reference training command when training on a very small dataset like Pokemons:
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
accelerate launch train_lcm_distill_lora_sdxl.py \
--pretrained_teacher_model=${MODEL_NAME} \
--pretrained_vae_model_name_or_path=${VAE_PATH} \
--output_dir="pokemons-lora-lcm-sdxl" \
--mixed_precision="fp16" \
--dataset_name=$DATASET_NAME \
--resolution=1024 \
--train_batch_size=24 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--use_8bit_adam \
--lora_rank=64 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=3000 \
--checkpointing_steps=500 \
--validation_steps=50 \
--seed="0" \
--report_to="wandb" \
--push_to_hub
This has 3k steps instead of 10k. The training dynamics for this aren't clear but https://wandb.ai/sayakpaul/text2image-fine-tune/runs/tv3zw00t definitely shows progress IMO.
Will merge after the CI is green.
Login to write a write a comment.
What does this PR do?
Add a
datasets
compatible variant of https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py. It also adapts a couple of best practices frompeft
:disable_adapters()
andenable_adapters()
functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.peft
utility modules positioning it as a utility library rather than a modelling library.Running a couple experiments. Will report back the findings. But should be more or less ready to be reviewed.
Basic training command
TODOs