sayakpaul1 year ago (edited 1 year ago)👍 1

What does this PR do?

Add a datasets compatible variant of https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py. It also adapts a couple of best practices from peft:

Instead of initializing two UNets (teacher and student), it leverages the disable_adapters() and enable_adapters() functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.
Makes use of peft utility modules positioning it as a utility library rather than a modelling library.

Running a couple experiments. Will report back the findings. But should be more or less ready to be reviewed.

Basic training command

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lora-sdxl"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir=${OUTPUT_DIR} \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 --loss_type="huber" \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1500 \
  --checkpointing_steps=50 \
  --validation_steps=25 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

TODOs

Update the PR with results
Docs
Tests

add: script to train lcm lora for sdxl with 🤗 datasets

ca7f220b

suit up the args.

88efd154

remove comments.

9e49fd2e

fix num_update_steps

728aa8a6

fix batch unmarshalling

bc8cfddf

fix num_update_steps_per_epoch

8c4d4b6a

fix; dataloading.

6d2f7407

fix microconditions.

c7f28284

sayakpaul requested a review from

patil-suraj 1 year ago

unconditional predictions debug

df707545

fix batch size.

dd93227e

no need to use use_auth_token

3d4b1da0

pcuenca commented on 2023-11-13

pcuenca1 year ago

Looks good so far, let me know if you need help testing / debugging

sayakpaul1 year ago

It works, needs hyperparameter tuning. If you could take it to a test run, that would be great!

patil-suraj approved these changes on 2023-11-13

patil-suraj1 year ago

Thanks a lot for working on this. Just left some comments. Most importantly:

Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.

Everything else looks good !

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	776		# 8. Add LoRA to the student U-Net, only the LoRA projection matrix will be updated by the optimizer.
	777		lora_config = LoraConfig(
	778		r=args.lora_rank,
	779		target_modules=[
	780		"to_q",
	781		"to_k",
	782		"to_v",
	783		"to_out.0",
	784		"proj_in",
	785		"proj_out",
	786		"ff.net.0.proj",
	787		"ff.net.2",
	788		"conv1",
	789		"conv2",
	790		"conv_shortcut",
	791		"downsamplers.0.conv",
	792		"upsamplers.0.conv",
	793		"time_emb_proj",

patil-suraj1 year ago👍 1

for later:

We could also think of making this an argument.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

sayakpaul1 year ago

Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.

Could you post the snippet with which you tried and the error you got?

patil-suraj1 year ago

Let's make sure we can load the saved lora with the pipeline for inference. I had some issues with it when I tried.

Could you post the snippet with which you tried and the error you got?

When I try to load the lora saved with script using

pipe.load_lora_weights(path_to_saved_lora, weight_name="pytorch_lora_weights.safetensors")

I get

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In [13], line 1
----> 1 pipe.load_lora_weights("valhalla/lcm-sdxl-distill-lora-k5", weight_name="pytorch_lora_weights.safetensors")

File ~/diffusers/src/diffusers/loaders.py:3233, in StableDiffusionXLLoraLoaderMixin.load_lora_weights(self, pretrained_model_name_or_path_or_dict, adapter_name, **kwargs)
   3230 if not is_correct_format:
   3231     raise ValueError("Invalid LoRA checkpoint.")
-> 3233 self.load_lora_into_unet(
   3234     state_dict, network_alphas=network_alphas, unet=self.unet, adapter_name=adapter_name, _pipeline=self
   3235 )
   3236 text_encoder_state_dict = {k: v for k, v in state_dict.items() if "text_encoder." in k}
   3237 if len(text_encoder_state_dict) > 0:

File ~/diffusers/src/diffusers/loaders.py:1630, in LoraLoaderMixin.load_lora_into_unet(cls, state_dict, network_alphas, unet, low_cpu_mem_usage, adapter_name, _pipeline)
   1627     if "lora_B" in key:
   1628         rank[key] = val.shape[1]
-> 1630 lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=True)
   1631 lora_config = LoraConfig(**lora_config_kwargs)
   1633 # adapter_name

File ~/diffusers/src/diffusers/utils/peft_utils.py:122, in get_peft_kwargs(rank_dict, network_alpha_dict, peft_state_dict, is_unet)
    120 rank_pattern = {}
    121 alpha_pattern = {}
--> 122 r = lora_alpha = list(rank_dict.values())[0]
    124 if len(set(rank_dict.values())) > 1:
    125     # get the rank occuring the most number of times
    126     r = collections.Counter(rank_dict.values()).most_common()[0][0]

Apply suggestions from code review

79672471

make vae encoding batch size an arg

6b2e42f2

final serialization in kohya

d7f632e6

style

e4edb31b

HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Merge branch 'main' into lcm-lora-sdxl-datasets

858009b4

state dict rejigging

6aa2dd8c

feat: no separate teacher unet.

1fd33782

debug

41354149

fix state dict serialization

3b066d26

debug

fc5546fe

debug

ba0d0f25

debug

35e30fbb

remove prints.

53c13f7f

remove kohya utility and make style

cff23edb

fix serialization

ca076c78

fix

808f61ea

sayakpaul1 year ago

@patil-suraj here's a summary of the changes:

No teacher UNet is being loaded separately. Thanks to disable_adapter() context (cc: @younesbelkada)
Cleaned the with contexts a bit.
Fixed the serialization.

The following should work now:

from diffusers import StableDiffusionXLPipeline
import torch


pipeline_id = "stabilityai/stable-diffusion-xl-base-1.0"
ckpt_id = "sayakpaul/lora-lcm-sdxl-new"

pipeline = StableDiffusionXLPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights(ckpt_id, use_auth_token=True)

We're now serializing the final checkpoint in the diffusers native format.

Currently, conducting an experiment with the following command:

CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir="lora-lcm-sdxl-new" \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-6 --loss_type="huber" \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=5000 \
  --checkpointing_steps=100 \
  --validation_steps=50 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

It'd be very nice to have your 👀 on this.

add test

842df25c

add peft dependency.

00276736

add: peft

c6255532

remove peft

c5317ff3

autocast device determination from accelerator

6a690abd

autocast

8c4eaf67

reduce lora rank.

cece7819

patrickvonplaten requested a review from

patil-suraj 1 year ago

patrickvonplaten1 year ago

cc @patil-suraj

patil-suraj commented on 2023-11-14

patil-suraj1 year ago

Looking very good, just left some suggestions. We should remove the empty prompt embeds options here, since we train the model with CFG it's not necessary. Apart from this, it's in a pretty good state already.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	598		if env_local_rank != -1 and env_local_rank != args.local_rank:
	599		args.local_rank = env_local_rank
	600
	601		if args.proportion_empty_prompts < 0 or args.proportion_empty_prompts > 1:
	602		raise ValueError("`--proportion_empty_prompts` must be in the range [0, 1].")

patil-suraj1 year ago

This should be removed.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	822		# create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
	823		def save_model_hook(models, weights, output_dir):
	824		if accelerator.is_main_process:
	825		unet_ = accelerator.unwrap_model(unet)
	826		# save weights in peft format to be able to load them back
	827		unet_.save_pretrained(output_dir)

patil-suraj1 year ago👍 1

We should save the intermediate checkpoints in the loadable format as well. Since sometimes intermediate checkpoints are better than final. So it's convenient to be able to just load and do inference with the checkpoint without having to manually convert the peft state dict.

sayakpaul1 year ago

WDYT about saving both in the peft format (as is currently being done during intermediate checkpoints) and also in the diffusers format?

This way:

We can easily resume from checkpoints maintaining the cleanliness of peft via load_adapter().
We can also immediately load an intermediate checkpoint with load_lora_weights().

patil-suraj1 year ago

Sounds good!

Conversation is marked as resolved

Show resolved

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	1201		# Get teacher model prediction on noisy_latents and conditional embedding
	1202		# Notice that we're disabling the adapter layers within the `unet` and then it becomes a
	1203		# regular teacher. This way, we don't have to separately initialize a teacher UNet.
	1204		with torch.no_grad() and torch.autocast(
	1205		str(accelerator.device), dtype=weight_dtype
	1206		) and unet.disable_adapter():

patil-suraj1 year ago

Since we are just using one unet now, this autocast might not be necessary. The unet is already prepared with accelerate which wraps the model in autocast when doing mixed-precision training

sayakpaul1 year ago

Kept it because the unet that's being trained is also under autocast (taken from the original script).

patil-suraj1 year ago

Curious to know how are the results with smaller datasets :)

remove unneeded space

beb8aa2c

Apply suggestions from code review

33cb9d03

style

795cc9f9

remove prompt dropout.

042f3578

also save in native diffusers ckpt format.

283af651

debug

5e099a24

debug

71db43a2

sayakpaul1 year ago (edited 1 year ago)

@DN6 we can't add pip install peft in the PR test workflow either as that swaps the training backend to use peft which is not fully supported for the other training examples.

I see the following as a potential workaround. In the workflow, we run all the example tests except for test_text_to_image_lcm_lora_sdxl without installing peft. Then we add a command to install peft and then just test test_text_to_image_lcm_lora_sdxl. Does this work for you?

Later, when the examples are fully equipped with peft we can revisit it. LMK.

Update: With #5388, this might already be fixed (cc: @younesbelkada).

debug

e1346d56

better formation of the null embeddings.

dfcf2340

remove space.

5ce6cc19

autocast fixes.

7ee9d5d9

autocast fix.

1b359ae8

hacky

82b628a3

sayakpaul commented on 2023-11-14

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	1171		# regular teacher. This way, we don't have to separately initialize a teacher UNet.
	1172		using_cuda = "cuda" in str(accelerator.device)
	1173		with torch.no_grad() and torch.autocast(
	1174		str(accelerator.device), dtype=weight_dtype if using_cuda else torch.bfloat16, enabled=using_cuda

sayakpaul1 year ago

@patil-suraj I am not super happy about how I am doing it here but it's the only way I found to make the test pass on a CPU. Let me know if you have better ideas.

younesbelkada1 year ago

@sayakpaul I can confirm all issues with respect to failing example CI are fixed with #5388. Perhaps we can merge first #5388 and I can work on the saving adapter config as a follow up PR. What do you think?

sayakpaul1 year ago👍 1

I will have a look at #5388.

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

Conversation is marked as resolved

Show resolved

younesbelkada commented on 2023-11-14

younesbelkada1 year ago

Hi @sayakpaul
Thanks for your great efforts!
I left few comments regarding the current state of the fine-tuning script, right now the script deviates a bit from the PEFT integration in diffusers as it utilises the PeftModel interface. Although it works fine, it might lead to some confusion (why load_adapter for inference that modifies in-place the model and get_peft_model that suddenly returns a PeftModel for training? ). IMO when using PEFT integration (in diffusers and/or transfomers) we should see peft as an utility library rather than a modeling library
I propose to slightly refactor the training logic, which should also work as expected, following #5388 - If that's something we want to ship asap that's totally fine and can be re-worked in a follow up PR, let me know if you need any help!

remove lora_sayak

17d5c0dd

Apply suggestions from code review

fea95e0f

style

83801a69

make log validation leaner.

0c5d9348

Merge branch 'main' into lcm-lora-sdxl-datasets

3b034bea

move back enabled in.

0f42185e

fix: log_validation call.

41f19258

add: checkpointing tests

bf5c5d6b

sayakpaul1 year ago (edited 1 year ago)

@younesbelkada I resolved all your comments. Could you take a look again? Thanks so much in advance! In particular, how I am disabling and enabling the adapters.

sayakpaul1 year ago

Tests will pass after #5388 is merged.

Merge branch 'main' into lcm-lora-sdxl-datasets

64063c7a

sayakpaul1 year ago

Also, @patil-suraj, I launched an experiment yesterday:

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lora-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir="lora-lcm-sdxl-new" \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-5 --loss_type="huber" \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=5000 \
  --checkpointing_steps=100 \
  --validation_steps=50 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

The results (https://wandb.ai/sayakpaul/text2image-fine-tune/runs/ahqu0wvp) are random noise. Anything crucial I am missing? Any usual suspect?

Merge branch 'main' into lcm-lora-sdxl-datasets

53cf0e76

pcuenca commented on 2023-11-17

pcuenca1 year ago

A couple of things I saw while testing:

I ran out of GPU memory while running the second evaluation. I verified my max batch size by running the script using --validation_steps=5 to verify that I could train, evaluate and go back to training successfully. But when I launched the training run for real with evaluation and checkpointing every 100 steps, it crashed at the 200 step mark. I'm not sure what could be the reason for this. I decreased my batch size (slightly, 12 to 10) after that, and the same thing happened again
There are many images smaller than 1024 pixels in the pokemon-blip-captions dataset (didn't count how many). Would that be a problem for SDXL training?

sayakpaul1 year ago (edited 1 year ago)

I ran out of GPU memory while running the second evaluation. I verified my max batch size by running the script using --validation_steps=5 to verify that I could train, evaluate and go back to training successfully. But when I launched the training run for real with evaluation and checkpointing every 100 steps, it crashed at the 200 step mark. I'm not sure what could be the reason for this. I decreased my batch size (slightly, 12 to 10) after that, and the same thing happened again

Didn't see this with the following (I am using a single GPU from the DGX):

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lora-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir="lora-lcm-sdxl-new" \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=16 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --lora_rank=16 \
  --learning_rate=1e-5 --loss_type="huber" --adam_weight_decay=0.0 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=5000 \
  --checkpointing_steps=100 \
  --validation_steps=50 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

There are many images smaller than 1024 pixels in the pokemon-blip-captions dataset (didn't count how many). Would that be a problem for SDXL training?

Could be but SDXL doesn't really excel at lower resolutions than that from what I have seen. Currently, I don't have a good workaround other than suggesting the use of an upscaler to upsample the images to 1024x1024.

Merge branch 'main' into lcm-lora-sdxl-datasets

de958dc6

younesbelkada approved these changes on 2023-11-17

younesbelkada1 year ago

Impressive work @sayakpaul ! All good on PEFT side!

Conversation is marked as resolved

Show resolved

patrickvonplaten requested a review from

patil-suraj 1 year ago

ArturFormella1 year ago

What about other files with the same problem?

train_lcm_distill_lora_sd_wds.py
train_lcm_distill_sd_wds.py
train_lcm_distill_sdxl_wds.py

More info:
huan085128/lcm_lora#1
https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py#L506C15-L506C15

Merge branch 'main' into lcm-lora-sdxl-datasets

5824fa3b

sayakpaul1 year ago

@ArturFormella this PR should provide a good reference for anyone willing to adjust the scripts accordingly. Plus @pcuenca is working on #5908.

Merge branch 'main' into lcm-lora-sdxl-datasets

f52cb6e7

taking my chances to see if disabling autocasting has any effect?

5534b0c2

sayakpaul1 year ago

@BenjaminBossan pinging you since Sourab is busy and Younes is OOO.

In this PR, I am doing what I described here: #5778 (comment). The training, however, seems to be completely off as it's yielding just noise: https://wandb.ai/sayakpaul/text2image-fine-tune/runs/ax6w9q0x?workspace=user-sayakpaul. I don't think it's just a matter of hyperparameters.

I suspect some peft related changes that I did in this PR might have caused it. I am unable to see any potential suspects in comparison to what we have in https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py (don't worry about the dataloading code). Would you be able to take a look and advise regarding the peft related changes?

BenjaminBossan1 year ago

Hi @sayakpaul I took a look at the script and also compared it to the other one you linked. I have no experience in training diffusion models, so I cannot judge if the generated output could be solved by some better hyper-parameters or if something else is completely off. The diff between the two scripts is quite large, so I could not pinpoint what difference causes the trouble.

The main difference that involves PEFT seems to be the way that the teacher network is being used via disable_adapters(). IIUC, the teacher net is completely frozen, whereas the student net is teacher + LoRA parameters on top, the latter being updated. Maybe something worth checking would be if requires_grad is set correctly on all params after disable_adapters() is called, and whether it is re-set correctly after enable_adapters() is being called. The diffusers implementation of these methods looks correct to me, but if training issues are PEFT related, this is what I would check first.

sayakpaul1 year ago

Thanks @BenjaminBossan for your inputs. I will get to them and let you know.

resolve conflicts

3bacd82d

start debugging

1da3071f

name

bd4d1c43

name

26f16c18

name

91740272

more debug

92ba868d

more debug

1fba251b

index

3751ca9b

remove index.

63649d35

print length

05de5422

print length

5e604a8a

print length

8fecdda2

sayakpaul1 year ago

@BenjaminBossan sorry for the delay here. Big log file here: https://huggingface.co/datasets/sayakpaul/sample-datasets/blob/main/logs.txt.

Some findings:

I started by printing some info on the original params_to_optimize here: https://github.com/huggingface/diffusers/blob/8fecdda23cefc30053a511db597b9c017b8fcaa1/examples/consistency_distillation/train_lcm_distill_lora_sdxl.py#L864C1-L865C64. Somewhat surprisingly, length of params_to_optimize (when packed as a list) is zero. This is not expected, no? However, params_to_optimize_named prints all the trainable params correctly.
I did the same after disabling adapters too: https://github.com/huggingface/diffusers/blob/lcm-lora-sdxl-datasets/examples/consistency_distillation/train_lcm_distill_lora_sdxl.py#L1174C1-L1178C109. It prints expected stuff. No trainable params.
Then I come down after enabling the adapters. print(len(list(params_to_optimize)), len(list(params_to_optimize_after_enable))) gives 0 1576, respectively. I think 1576 is expected and should have been the case for params_to_optimize as well.

move unet.train() after add_adapter()

023866f8

disable some prints.

07c28de8

enable_adapters() manually.

c6a61dac

BenjaminBossan1 year ago👍 1

Somewhat surprisingly, length of params_to_optimize (when packed as a list) is zero. This is not expected, no?

The issue here is that the result of filter is lazy, like a generator expression. Since you pass the same expression to the optimizer right after, it is exhausted, so when you call list on it after that, it is empty. Almost certainly, it would contain the same number if you cast it to a list before passing it to the optimizer.

The other numbers are, as you say, pretty much what's expected, so my suspicion that the issue could be related to requires_grad is very unlikely at this point.

Instead of initializing two UNets (teacher and student), it leverages the disable_adapters() and enable_adapters() functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.

Do you have a reference for using this trick for training? I wonder if there could be some kind of flaw in that approach which could explain the findings.

sayakpaul1 year ago

Do you have a reference for using this trick for training? I wonder if there could be some kind of flaw in that approach which could explain the findings.

This is taken from https://huggingface.co/blog/stackllama. It's implemented in TRL too:
https://github.com/huggingface/trl/blob/baa8f09cb35057c03b33d898f90c7e8ff958ed9b/trl/trainer/ppo_trainer.py#L468

@lvwerra could you comment more on the above with respect to:

Instead of initializing two UNets (teacher and student), it leverages the disable_adapters() and enable_adapters() functions. In this case, the teacher is without any adapters and the student is with LoRA. We only update the LoRA params in the student.

lvwerra1 year ago

I think @younesbelkada would be the right person to answer this :)

remove prints.

ec33085e

Merge branch 'main' into lcm-lora-sdxl-datasets

d14dd411

some changes.

ed7969d2

fix params_to_optimize

8c549e4b

more fixes

94460666

debug

0153665f

debug

b9891ffb

remove print

b11b0a6d

disable grad for certain contexts.

539bda39

sayakpaul1 year ago

@BenjaminBossan the main culprit was not putting torch.no_grad() where it was needed. I added that in 539bda3.

Now, things are slowly showing progress :-)

BenjaminBossan1 year ago

the main culprit was not putting torch.no_grad()

Awesome that you figured it out 🎉

Merge branch 'main' into lcm-lora-sdxl-datasets

dfe916dd

patil-suraj marked this pull request as ready for review 1 year ago

Add support for IPAdapterFull (#5911)

d5a40cde

Fix a bug in `add_noise` function (#6085)

e3d76c47

[Advanced Diffusion Script] Add Widget default text (#6100)

472c3974

[Advanced Training Script] Fix pipe example (#6106)

373d3923

IP-Adapter for StableDiffusionControlNetImg2ImgPipeline (#5901)

be46b6eb

IP adapter support for most pipelines (#5900)

c7a87ca7

sayakpaul1 year ago

Need to propagate the changes from #6145 once it's merged.

resolve conflicts

556b7977

Merge branch 'main' into lcm-lora-sdxl-datasets

a8d97858

fix: lora_alpha

47abcf6b

make vae casting conditional/

b7c0f95f

param upcasting

7a1d6c90

propagate comments from https://github.com/huggingface/diffusers/pull…

87f87a70

sayakpaul changed the title ~~[WIP][Training] Add `datasets` version of LCM LoRA SDXL~~ [Training] Add `datasets` version of LCM LoRA SDXL 1 year ago

sayakpaul1 year ago

@dg845 as well if you want to give this a look :-)

dg845 commented on 2023-12-21

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	192
	193
	194		# From LCMScheduler.get_scalings_for_boundary_condition_discrete
	195		def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0):
	196		c_skip = sigma_data2 / ((timestep / 0.1) 2 + sigma_data**2)
	197		c_out = (timestep / 0.1) / ((timestep / 0.1) 2 + sigma_data2) ** 0.5
	198		return c_skip, c_out

dg8451 year ago

We could make this a little more general by using the timestep_scaling argument (and perhaps expose this as an argument):

Suggested change

      
            def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0):
          
                c_skip = sigma_data**2 / ((timestep / 0.1) ** 2 + sigma_data**2)
          
                c_out = (timestep / 0.1) / ((timestep / 0.1) ** 2 + sigma_data**2) ** 0.5
          
                return c_skip, c_out
          
            def scalings_for_boundary_conditions(timestep, sigma_data=0.5, timestep_scaling=10.0):
          
                scaled_timestep = timestep_scaling * timestep
          
                c_skip = sigma_data**2 / (scaled_timestep**2 + sigma_data**2)
          
                c_out = scaled_timestep / (scaled_timestep**2 + sigma_data**2) ** 0.5
          
                return c_skip, c_out

(The current function is the same as in examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py so if we make this change we should probably also propagate it to the WebDataset scripts.)

sayakpaul1 year ago👍 1

Would prefer reflecting that after it's changed in WDS so that it's easier to track.

patil-suraj1 year ago

+1, let's adapt this for both scripts.

dg845 commented on 2023-12-21

examples/consistency_distillation/train_lcm_distill_lora_sdxl.py

	789
	790		# 9. Add LoRA to the student U-Net, only the LoRA projection matrix will be updated by the optimizer.
	791		lora_config = LoraConfig(
	792		r=args.lora_rank,
	793		lora_alpha=args.lora_rank,

dg8451 year ago

Would it make sense to allow lora_alpha to be set independently of the r/lora_rank, to allow the LoRA layer scaling to be controlled?

sayakpaul1 year ago👍 1

Maybe in a future PR. Since this PR is almost just a copy-paste of the WDS version.

patil-suraj1 year ago

makes sense.

dg845 commented on 2023-12-21

Conversation is marked as resolved

Show resolved

dg845 commented on 2023-12-21

Conversation is marked as resolved

Show resolved

Merge branch 'main' into lcm-lora-sdxl-datasets

404351fa

sayakpaul1 year ago

@patil-suraj @dg845 ran a recent experiment with the following:

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="lora-lcm-sdxl-new"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

CUDA_VISIBLE_DEVICES=1 accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir="lora-lcm-sdxl-new" \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --lora_rank=64 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=10000 \
  --checkpointing_steps=3000 \
  --validation_steps=50 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

WandB: https://wandb.ai/sayakpaul/text2image-fine-tune/runs/tv3zw00t
Weights: https://huggingface.co/sayakpaul/pokemons-lora-lcm-sdxl

Given it's only 10k steps and I haven't ablated the hyperparameters, I'd say it's still good enough. WDYT? I'd be keen on merging this soon as it reduces the memory requirements significantly by following good PEFT practices.

WDYT?

patil-suraj1 year ago

Sounds good! Usually in my experiment 2-3k steps were enough, 10k results in overfitting with large BS.

patil-suraj approved these changes on 2023-12-26

Conversation is marked as resolved

Show resolved

[Peft] fix saving / loading when unet is not "unet" (#6046)

4c7e983b

[Wuerstchen] fix fp16 training and correct lora args (#6245)

0bb9cf02

[docs] fix: animatediff docs (#6339)

11659a6f

add: note about the new script in readme_sdxl.

f645b87e

Revert "[Peft] fix saving / loading when unet is not "unet" (#6046)"

fd64acf9

Revert "[Wuerstchen] fix fp16 training and correct lora args (#6245)"

121567b0

Revert "[docs] fix: animatediff docs (#6339)"

c24626ae

remove tokenize_prompt().

4c689b29

assistive comments around enable_adapters() and diable_adapters().

1b49fb92

sayakpaul1 year ago (edited 1 year ago)

@patil-suraj I added the following for the reference training command when training on a very small dataset like Pokemons:

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch train_lcm_distill_lora_sdxl.py \
  --pretrained_teacher_model=${MODEL_NAME}  \
  --pretrained_vae_model_name_or_path=${VAE_PATH} \
  --output_dir="pokemons-lora-lcm-sdxl" \
  --mixed_precision="fp16" \
  --dataset_name=$DATASET_NAME \
  --resolution=1024 \
  --train_batch_size=24 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --use_8bit_adam \
  --lora_rank=64 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=3000 \
  --checkpointing_steps=500 \
  --validation_steps=50 \
  --seed="0" \
  --report_to="wandb" \
  --push_to_hub

This has 3k steps instead of 10k. The training dynamics for this aren't clear but https://wandb.ai/sayakpaul/text2image-fine-tune/runs/tv3zw00t definitely shows progress IMO.

Merge branch 'main' into lcm-lora-sdxl-datasets

9b3dbaaf

sayakpaul1 year ago🎉 1

Will merge after the CI is green.

sayakpaul merged 6683f979 into main 1 year ago

sayakpaul deleted the lcm-lora-sdxl-datasets branch 1 year ago

	445		parser.add_argument(
	446		"--learning_rate",
	447		type=float,
	448		default=1e-4,

	482		parser.add_argument("--adam_epsilon", type=float, default=1e-08, help="Epsilon value for the Adam optimizer")
	483		parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
	484		# ----Diffusion Training Arguments----
	485		parser.add_argument(
	486		"--proportion_empty_prompts",
	487		type=float,
	488		default=0.0,
	489		help="Proportion of image prompts to be replaced with empty strings. Defaults to 0 (no prompt replacement).",
	490		)

	830		if accelerator.is_main_process:
	831		unet_ = accelerator.unwrap_model(unet)
	832		lora_state_dict = get_peft_model_state_dict(unet_, adapter_name="default")
	833		StableDiffusionXLPipeline.save_lora_weights(os.path.join(output_dir, "unet_lora"), lora_state_dict)

	745		revision=args.teacher_revision,
	746		)
	747
	748		# 5. Load teacher U-Net from SD-XL checkpoint
	749		teacher_unet = UNet2DConditionModel.from_pretrained(
	750		args.pretrained_teacher_model, subfolder="unet", revision=args.teacher_revision
	751		)

	1149
	1150		# encode pixel values with batch size of at most 8
	1151		latents = []
	1152		for i in range(0, pixel_values.shape[0], 8):

diffusers
[Training] Add `datasets` version of LCM LoRA SDXL
#5778

Merged

[Training] Add `datasets` version of LCM LoRA SDXL #5778

What does this PR do?

Basic training command

TODOs

	1209		# noisy_latents with both the conditioning embedding c and unconditional embedding 0
	1210		# Get teacher model prediction on noisy_latents and conditional embedding
	1211		with torch.no_grad():
	1212		with torch.autocast("cuda"):

	with torch.autocast("cuda"):
	with torch.autocast("cuda", dtype=weight_dtype):

	1257
	1258		# Get target LCM prediction on x_prev, w, c, t_n
	1259		with torch.no_grad():
	1260		with torch.autocast("cuda", enabled=True, dtype=weight_dtype):

	1261		target_noise_pred = unet(
	1262		x_prev.float(),
	1263		timesteps,
	1264		timestep_cond=None,

	1190		noise_pred = unet(
	1191		noisy_model_input,
	1192		start_timesteps,
	1193		timestep_cond=None,

	1		#!/usr/bin/env python
	2		# coding=utf-8
	3		# Copyright 2023 The HuggingFace Inc. team. All rights reserved.

	# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
	# Copyright 2023 The LCM team and the HuggingFace Inc. team. All rights reserved.

	550		" https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices"
	551		),
	552		)
	553		parser.add_argument(
	554		"--cast_teacher_unet",
	555		action="store_true",
	556		help="Whether to cast the teacher U-Net to the precision specified by `--mixed_precision`.",
	557		)

-    parser.add_argument(
-        "--cast_teacher_unet",
-        action="store_true",
-        help="Whether to cast the teacher U-Net to the precision specified by `--mixed_precision`.",
-    )

	739		revision=args.teacher_revision,
	740		)
	741
	742		# 5. Load teacher U-Net from SD-XL checkpoint
	743		# teacher_unet = UNet2DConditionModel.from_pretrained(
	744		# args.pretrained_teacher_model, subfolder="unet", revision=args.teacher_revision
	745		# )

	default=1e-4,
	default=1e-6,

	748		vae.requires_grad_(False)
	749		text_encoder_one.requires_grad_(False)
	750		text_encoder_two.requires_grad_(False)
	751		# teacher_unet.requires_grad_(False)

	806		text_encoder_one.to(accelerator.device, dtype=weight_dtype)
	807		text_encoder_two.to(accelerator.device, dtype=weight_dtype)
	808
	809		# Move teacher_unet to device, optionally cast to weight_dtype
	810		# teacher_unet.to(accelerator.device)
	811		# if args.cast_teacher_unet:
	812		# teacher_unet.to(dtype=weight_dtype)

	853		"xFormers 0.0.16 cannot be used for training in some GPUs. If you observe problems during training, please update xFormers to at least 0.0.17. See https://huggingface.co/docs/diffusers/main/en/optimization/xformers for more details."
	854		)
	855		unet.enable_xformers_memory_efficient_attention()
	856		# teacher_unet.enable_xformers_memory_efficient_attention()

-    # 5. Load teacher U-Net from SD-XL checkpoint
-    # teacher_unet = UNet2DConditionModel.from_pretrained(
-    #     args.pretrained_teacher_model, subfolder="unet", revision=args.teacher_revision
-    # )

-    # Move teacher_unet to device, optionally cast to weight_dtype
-    # teacher_unet.to(accelerator.device)
-    # if args.cast_teacher_unet:
-    #     teacher_unet.to(dtype=weight_dtype)

	798		# be independently loaded via `load_lora_weights()`.
	799		peft_state_dict = get_peft_model_state_dict(unet, adapter_name="default")
	800		diffusers_state_dict = convert_state_dict_to_diffusers(peft_state_dict)
	801		diffusers_state_dict = {
	802		f"{module_name.replace('base_model.model.', '')}.{module_name}": param
	803		for module_name, param in diffusers_state_dict.items()
	804		}

-                diffusers_state_dict = {
-                    f"{module_name.replace('base_model.model.', '')}.{module_name}": param
-                    for module_name, param in diffusers_state_dict.items()
-                }

	761		"time_emb_proj",
	762		],
	763		)
	764		unet = get_peft_model(unet, lora_config)

	unet = get_peft_model(unet, lora_config)
	unet.add_adapter(lora_config)

	857
	858		# 12. Optimizer creation
	859		optimizer = optimizer_class(
	860		unet.parameters(),

	unet.parameters(),
	filter(lambda p: p.requires_grad, unet.parameters()),

	unet_.save_pretrained(output_dir)
	unet_.peft_config.save_pretrained(output_dir)

	1170		# Notice that we're disabling the adapter layers within the `unet` and then it becomes a
	1171		# regular teacher. This way, we don't have to separately initialize a teacher UNet.
	1172		using_cuda = "cuda" in str(accelerator.device)
	1173		with torch.no_grad() and torch.autocast(

diffusers [Training] Add `datasets` version of LCM LoRA SDXL #5778 Merged

[Training] Add `datasets` version of LCM LoRA SDXL #5778

What does this PR do?

Basic training command

TODOs

diffusers
[Training] Add `datasets` version of LCM LoRA SDXL
#5778

Merged

	with torch.no_grad() and torch.autocast(
	unet.disable_adapters()
	with torch.no_grad() and torch.autocast(

	1179		encoder_hidden_states=prompt_embeds.to(weight_dtype),
	1180		added_cond_kwargs={k: v.to(weight_dtype) for k, v in encoded_text.items()},
	1181		).sample
	1182		cond_pred_x0 = predicted_origin(

-                    cond_pred_x0 = predicted_origin(
+                    # re-enable unet adapters
+                    unet.enable_adapters()
+                    cond_pred_x0 = predicted_origin(

	1167		# Notice that we're disabling the adapter layers within the `unet` and then it becomes a
	1168		# regular teacher. This way, we don't have to separately initialize a teacher UNet.
	1169		using_cuda = "cuda" in str(accelerator.device)
	1170		unet.disable_adapters()

	1194		# predicted noise eps_0 and predicted original sample x_0, then run the ODE solver using these
	1195		# estimates to predict the data point in the augmented PF-ODE trajectory corresponding to the next ODE
	1196		# solver timestep.
	1197		unet.disable_adapters()

	1257		# Note that the DDIM step depends on both the predicted x_0 and source noise eps_0.
	1258		x_prev = solver.ddim_step(pred_x0, pred_noise, index).to(unet.dtype)
	1259
	1260		# re-enable unet adapters
	1261		unet.enable_adapters()

	242		return out.reshape(b, ((1,) (len(x_shape) - 1)))
	243
	244
	245		def tokenize_prompt(tokenizer, prompt):