PromeAIpro261 days ago❤ 12🚀 1

What does this PR do?

In this commit we add train flux-controlnet scripts in examples, and tested it on A100-SXM4-80GB.

Using this train script, We can customize the number of layers of the transformer, by setting --num_double_layers=4 --num_single_layers=0 , by this setting, the GPU memory demand is 60G, with batchsize 2, and 512 resolution.

discussed in #9085

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

add train flux-controlnet scripts in example.

8ab9b5b0

fix error

4a535737

Mason-McGough commented on 2024-08-31

Conversation is marked as resolved

Show resolved

fix subfolder error

14e9970e

Merge branch 'main' into flux-controlnet-train

3bb431c4

yiyixuxu requested a review from

sayakpaul 256 days ago

yiyixuxu256 days ago

@haofanwang @wangqixun
would you be willing to give this a review if you have time?

HuggingFaceDocBuilderDev256 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fix preprocess error

973c6fb1

Merge branch 'flux-controlnet-train_x' into flux-controlnet-train

599c984f

Merge branch 'main' into flux-controlnet-train

24b58f85

Merge branch 'main' into flux-controlnet-train

22a3e101

linjiapro248 days ago (edited 248 days ago)

@PromeAIpro

Can we have some sample training results (such as images) from this script attached in the doc, or anywhere suitable?

PromeAIpro247 days ago (edited 241 days ago)👍 2

Here are some training results by lineart controlnet.

input	output	prompt
		cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere
		a busy urban intersection during daytime. The sky is partly cloudy with a mix of blue and white clouds. There are multiple traffic lights, and vehicles are seen waiting at the red signals. Several businesses and shops are visible on the side, with signboards and advertits. The road is wide, and there are pedestrian crossings. Overall, it appears to be a typical day in a bustling city.

First train on 512res and then fine-tune with 1024res

Merge branch 'main' into flux-controlnet-train

32eb1ef4

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-13

examples/controlnet/README_flux.md

	103		* `report_to="tensorboard` will ensure the training runs are tracked on Weights and Biases.
	104		* `validation_image`, `validation_prompt`, and `validation_steps` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
	105
	106		Our experiments were conducted on a single 40GB A100 GPU.

sayakpaul247 days ago

Wow, 40GB A100 seems doable.

PromeAIpro246 days ago

I'm sorry, this is the 80g A100 (I wrote it wrong), I did a lot of extra work to get it to train with the zero3 on the 40g A100, but I don't think this is suitable for everyone

sayakpaul246 days ago

Not at all. I think it would still be nice to include the changes you had to make in the form of notes in the README. Does that work?

PromeAIpro246 days ago

I'll see if I can add it later.

PromeAIpro246 days ago❤ 1

@sayakpaul We added a tutorial on configuring deepspeed in the readme.

linjiapro246 days ago

There are some tricks to lower GPU:

gradient_checkpointing
bf16 or fp16.
batch size 1, and then use gradient_accumulation_steps above 1

With 1, 2, 3, can this thing be controlled to be trained under 40GB?

PromeAIpro246 days ago (edited 246 days ago)👍 3

According to my practice, deepspeedzero3 must be used, @linjiapro your settings will cost about 70g when 1024 with bs 1 or 512 with bs 3.

ghost242 days ago

sorry to bother you, have you ever tried cache text-encoder and vae latents to run with lower GPU？ @PromeAIpro @linjiapro

PromeAIpro242 days ago👍 1

cache text-encoder is already available in this script (saving about 10g of gpu memory on T5), about cache vae You can check how to use deepspeed in the readme, which includes cache vae.

christopher-beckham239 days ago

fyi you can also reduce memory usage by using optimum-quanto and qint8 quantising all of the modules except the controlnet (not activation quantisation, just the weights). I ran some experiments on this with my own controlnet training script and it seems to work just fine.

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

examples/controlnet/README_flux.md

	115		from diffusers.pipelines.flux.pipeline_flux_controlnet import FluxControlNetPipeline
	116		from diffusers.models.controlnet_flux import FluxControlNetModel
	117
	118		base_model = 'black-forest-labs/FLUX.1-dev'
	119		controlnet_model = 'path to controlnet'
	120		controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)
	121		pipe = FluxControlNetPipeline.from_pretrained(base_model,
	122		controlnet=controlnet,
	123		torch_dtype=torch.bfloat16)
	124		pipe.to("cuda")

sayakpaul247 days ago

Suggested change

      
            base_model = 'black-forest-labs/FLUX.1-dev'
          
            controlnet_model = 'path to controlnet'
          
            controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)
          
            pipe = FluxControlNetPipeline.from_pretrained(base_model, 
          
                                                          controlnet=controlnet, 
          
                                                          torch_dtype=torch.bfloat16)
          
            pipe.to("cuda")
          
            base_model = 'black-forest-labs/FLUX.1-dev'
          
            controlnet_model = 'path to controlnet'
          
            controlnet = FluxControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)
          
            pipe = FluxControlNetPipeline.from_pretrained(
          
                base_model, 
          
                controlnet=controlnet, 
          
                torch_dtype=torch.bfloat16
          
            )
          
            # enable memory optimizations   
          
            pipe.enable_model_cpu_offload()

Most people may not have the necessary VRAM to run it like this. So, better have it this way? WDYT?

PromeAIpro246 days ago

yes, you are right

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-13

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-13

sayakpaul247 days ago

Hi, thanks for your PR. I just left some initial comments. LMK what you think.

Update examples/controlnet/README_flux.md

57d143bb

Update examples/controlnet/README_flux.md

af1b7a50

fix readme

d19b101c

fix note error

64251ac5

add some Tutorial for deepspeed

c98d43f8

fix some Format Error

569e0de8

Merge branch 'main' into flux-controlnet-train

916fd80a

sayakpaul commented on 2024-09-14

sayakpaul246 days ago

Thanks! Appreciate your hard work here. Left some more comments.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

sayakpaul245 days ago

Can we fix the code quality issues? make quality && make style?

add dataset_path example

67deb7a6

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

76bcf5a1

remove print, add guidance_scale CLI, readable apply

32fbeac2

sayakpaul commented on 2024-09-15

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-15

sayakpaul245 days ago

Thank you! Left some more comments. Let me know if they make sense or are unclear.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Update examples/controlnet/README_flux.md

b03cb01c

Merge branch 'main' into flux-controlnet-train

7b984595

update,push_to_hub,save_weight_dtype,static method,clear_objs_and_ret…

443f251f

add push to hub in readme

bc68f1a7

sayakpaul commented on 2024-09-16

sayakpaul244 days ago

Left some additional minor comments but I see existing comments are yet to be addressed. Let me know when you would like another round of review.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

apply weighting schemes

fe2a5871

add note

3dc16cac

Laidawang243 days ago

@sayakpaul hey, I think I have fixed all the issues, time to start a new review.

Update examples/controlnet/README_flux.md

aff09514

Merge branch 'main' into flux-controlnet-train

b8585071

sayakpaul commented on 2024-09-18

examples/controlnet/train_controlnet_flux.py

	1254		bsz = pixel_latents.shape[0]
	1255		noise = torch.randn_like(pixel_latents).to(accelerator.device).to(dtype=weight_dtype)
	1256		# Sample a random timestep for each image
	1257		# for weighting schemes where we sample timesteps non-uniformly
	1258		u = compute_density_for_timestep_sampling(
	1259		weighting_scheme=args.weighting_scheme,
	1260		batch_size=bsz,
	1261		logit_mean=args.logit_mean,
	1262		logit_std=args.logit_std,
	1263		mode_scale=args.mode_scale,
	1264		)
	1265		indices = (u * noise_scheduler_copy.config.num_train_timesteps).long()
	1266		timesteps = noise_scheduler_copy.timesteps[indices].to(device=pixel_latents.device)
	1267
	1268		# Add noise according to flow matching.
	1269		sigmas = get_sigmas(timesteps, n_dim=pixel_latents.ndim, dtype=pixel_latents.dtype)
	1270		noisy_model_input = (1.0 - sigmas) * pixel_latents + sigmas * noise

sayakpaul241 days ago

I thought we were using a different timestep sampling procedure and I suggested to have that as a default. Are we not doing that anymore?

PromeAIpro241 days ago (edited 241 days ago)

Do you mean to set the original sampling scheme as default?

For the weighting schema i just copied from here.

sayakpaul241 days ago

Yeah I meant to keep the sigmoid sampling as your default and let users configure it as we do in the other scripts.

PromeAIpro241 days ago

Could you please write it down briefly? I'm not sure how to edit it. It seems to me that if you use logit_normal, you should be using sigmoid?

PromeAIpro241 days ago👍 1

Just need to change weighting_scheme from the default value to logit_normal?

sayakpaul241 days ago

Okay. But it depends on an std and mean. IIRC your scheme did torch.randn() and applied sigmoid right?

PromeAIpro240 days ago (edited 240 days ago)👍 1

Yes, this uses torch.randn() at first, but after given the examples you provided, I think this is maybe a better solution for us？

sayakpaul commented on 2024-09-18

sayakpaul241 days ago

Left some comments but my concerns:

Why remove the previous timesteps computing scheme?
Let's provide a reasonable ControlNet checkpoint derived from your experiments.

LMK if anything is unclear.

PromeAIpro closed this 241 days ago

sayakpaul241 days ago

@PromeAIpro we didn't have to close this PR. Is there anything we could do to revive this PR? We could very much like to do that. Please let us know.

sayakpaul reopened this 241 days ago

PromeAIpro241 days ago

@PromeAIpro we didn't have to close this PR. Is there anything we could do to revive this PR? We could very much like to do that. Please let us know.

sry, i do it by mistake

make code style and quality

7bdf9e3b

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

ba45495d

fix some unnoticed error

c862d393

make code style and quality

4b979e0b

sayakpaul commented on 2024-09-19

sayakpaul241 days ago

Thanks. I think this is looking good. Some minor comments.

Also, we would need to add tests like in https://github.com/huggingface/diffusers/blob/main/examples/controlnet/test_controlnet.py.

@yiyixuxu could you review the changes made to the ControlNet pipeline?

Merge branch 'main' into flux-controlnet-train

0655a759

add example controlnet in readme

90badc29

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

47555579

add test controlnet

e3d10bc1

rm Remove duplicate notes

f9400a6f

Merge branch 'main' into flux-controlnet-train

192bbeea

Fix formatting errors

de06965c

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

8ee2daf6

PromeAIpro240 days ago

Thanks. I think this is looking good. Some minor comments.

Also, we would need to add tests like in https://github.com/huggingface/diffusers/blob/main/examples/controlnet/test_controlnet.py.

@yiyixuxu could you review the changes made to the ControlNet pipeline?

added test in test_controlnet

sayakpaul commented on 2024-09-20

Conversation is marked as resolved

Show resolved

add new control image

17fc1ee3

sayakpaul requested a review from

yiyixuxu 240 days ago

sayakpaul240 days ago

@yiyixuxu could you review the changes made to the ControlNet Flux pipeline once you have a moment?

Merge branch 'main' into flux-controlnet-train

213faf93

Night1099238 days ago

@PromeAIpro Hi great work, can this also train on Flux Schnell, or only dev rn.

Mason-McGough237 days ago👍 2

@PromeAIpro Hi great work, can this also train on Flux Schnell, or only dev rn.

Training on Schnell seems to work but I had to set guidance=None during the forward pass.

ShunyuYao237 days ago (edited 237 days ago)

excellent job, but i have a question. I tried the scripts with 512 resolution, bf16, batch size 1 and it uses 76GB memory on a A800(80GB). And 1024 reso cannot be trained because of the memory. Any suggestions?
with following settings:

accelerate config:
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: '1'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

training script:
accelerate launch --main_process_port 29511 --config_file acc_config_singlegpu.yaml train_controlnet_flux.py
--pretrained_model_name_or_path="/home/export/base/ycsc_yaosy/yaosy/online1/models/black-forest-labs/FLUX.1-dev"
--jsonl_for_train="./controlnet_sdxl_train_5examples.jsonl"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="./controlnet_example_512"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=15000
--validation_steps=5
--checkpointing_steps=200
--validation_image "test.jpg"
--validation_prompt "..."
--train_batch_size=1
--gradient_accumulation_steps=4
--report_to="tensorboard"
--num_double_layers=4
--num_single_layers=0
--seed=42

PromeAIpro237 days ago (edited 237 days ago)

@ShunyuYao try to use --use_adafactor as a Optimizer maybe？also by using the latest code, you can use --enable_model_cpu_offload to run it in 1024res with AdamW.

Here are my setting(cause about 66g for training).Please delete the # comment when you use:

CUDA_VISIBLE_DEVICES=0 python ../train_controlnet_flux.py \
    --pretrained_model_name_or_path=$MODEL_DIR \
    --dataset_name=fusing/fill50k \
    --max_train_samples=100 \
    --conditioning_image_column=conditioning_image \
    --image_column=image \
    --caption_column=text \
    --output_dir=$OUTPUT_DIR \
    --mixed_precision="bf16" \
    --resolution=1024 \
    --learning_rate=1e-5 \
    --max_train_steps=10 \
    --checkpointing_steps=11 \
    --validation_steps=1 \
    --validation_image "./conditioning_image_1.png" \
    --validation_prompt "red circle with blue background" \
    --num_validation_images=1 \
    --train_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --report_to="wandb" \
    --num_double_layers=4 \
    --num_single_layers=0 \
    --seed=42 \
    --save_weight_dtype="bf16" \
    --push_to_hub \
    --enable_model_cpu_offload \ # will cause slower training
    --use_adafactor \  # save 10g memory

add model cpu offload

b533cae5

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

be965f0e

Merge branch 'main' into flux-controlnet-train

a2daa9f2

update help for adafactor

4d7c1afb

PromeAIpro requested a review from

sayakpaul 237 days ago

Mason-McGough236 days ago

@ShunyuYao I would try to precompute the text embeddings (and maybe the VAE outputs too) if possible. Those will save you a few gigabytes.

yiyixuxu commented on 2024-09-23

src/diffusers/pipelines/flux/pipeline_flux_controlnet.py

860	860	joint_attention_kwargs=self.joint_attention_kwargs,
861	861	return_dict=False,
862	862	)
	863	# ensure dtype

yiyixuxu236 days ago

why is this needed?

PromeAIpro236 days ago (edited 236 days ago)

see discuss #9324 (comment)
and https://github.com/huggingface/diffusers/pull/9324/files/32eb1ef4897332954f3f0e967ff165e09e341ed8#r1758447457

we think rather writing convert code in train script, it better to writing them in pipeline. (now is writen both in training script and pipeline). It is just an ensure, and brings no effect in inference.

yiyixuxu235 days ago

I looked at the comment, it is still not explained why it is needed
we have no issue running inference with the available controlnet checkpoint without this change.

PromeAIpro235 days ago

it is right, the change of dtype convertion in pipeline was not relating much with controlnet training script, we found the dtype inconsistency issue when writing training script, it doesn't happen during inference now, but we fix that, it was a by-the-way. Maybe adapt this fix in a new issue when dtype inconsistency happens in future ?
we hold neutral position towards that, how do you think? @yiyixuxu @sayakpaul

yiyixuxu235 days ago

yes, a separate issue would be nice! and maybe a minimum reproducible script to help understand the issue

PromeAIpro235 days ago (edited 235 days ago)

This is because t5 does not support autocast (causing black images). However, during validation, our controlnet is fp32 and our transformer is bf16, so we need to explicitly convert the dtype in the pipeline.

PromeAIpro235 days ago

start with an new issue #9527

yiyixuxu234 days ago

but validation is to log outputs - why cannot we run controlnet in bf16 too? anyways I think this change should not be in pipelines for now:)

PromeAIpro234 days ago

yes, Is there a way for diffusers to clone controlnet? we consider cloning a copy and converting it to bf16 for validation, If we directly convert the original weights, we will lose precision.

PromeAIpro234 days ago

The fundamental solution is to support the autocast problem of t5 here(#9527)

sayakpaul234 days ago👍 1

Okay something that would work is the following:

We compute all the text embeddings beforehand in the validation loop and then delete the text encoders.
We proceed with our regular validation with the precomputed text embeddings.

Would this work?

PromeAIpro234 days ago

works, now the change of pipeline is removed

christopher-beckham236 days ago👍 1

Hi,

Just shamelessly plugging my ControlNet repo here which I just made public: https://github.com/christopher-beckham/flux-controlnet

Feel free to pick and choose things from the code if you think it could help with your PR. I have explained some of it in the README. While there is no public dataset associated with this repo I have trained with qint8 quantisation + 8-bit ADAM on a fairly large internal dataset and gotten more or less decent images on a 40GB GPU.

Some of the tricks mentioned here may also be of use: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

sayakpaul236 days ago👍 1

https://github.com/christopher-beckham/flux-controlnet

@christopher-beckham thanks for sharing your work! Looks very cool!

The purpose of the scripts within examples (at least the ones we officially maintain at the moment) is to provide barebones. So, I think it's okay for the moment to skip with quantization related bits and other things.

The simplest reasonable defaults that lead to okay results are fine, IMO. So, what we could do is provide mentions to the other popular ControlNet trainers like yours from the README in case users want to take things further. I hope that works.

Merge branch 'main' into flux-controlnet-train

a11219ce

sayakpaul236 days ago (edited 236 days ago)

Just reviewed it!

I think it looks quite good, apart from @yiyixuxu's concerns here: #9324 (comment).

I would probably lean towards doing it from the training script because otherwise, it would add more maintenance. But I will let Yiyi comment further.

@PromeAIpro could you please follow the instructions from the CI and ensure the core quality checks pass?

make quality & style

49a14920

PromeAIpro236 days ago (edited 236 days ago)

Just reviewed it!

I think it looks quite good, apart from @yiyixuxu's concerns here: #9324 (comment).

I would probably lean towards doing it from the training script because otherwise, it would add more maintenance. But I will let Yiyi comment further.

@PromeAIpro could you please follow the instructions from the CI and ensure the core quality checks pass?

yes, i know it, This is because t5 does not support autocast (causing black images). However, during validation, our controlnet is fp32 and our transformer is bf16, so we need to explicitly convert the dtype in the pipeline.

@sayakpaul already run make style and make quality,Is there anything else that needs to be done?

PromeAIpro236 days ago

@sayakpaul not quite sure about this.

sayakpaul236 days ago👍 1

Can you try the following?

Create a new Python environment.
Activate it.
Go to your local clone of diffusers and run `pip install -e ".[quality]"
And then run make style && make quality?

Merge branch 'main' into flux-controlnet-train

6169b619

make quality and style

d895b8ff

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

395d2f7b

PromeAIpro236 days ago

@sayakpaul works! try to check it again

rename flux_controlnet_model_name_or_path

b6a90211

ShunyuYao235 days ago (edited 235 days ago)

@ShunyuYao try to use --use_adafactor as a Optimizer maybe？also by using the latest code, you can use --enable_model_cpu_offload to run it in 1024res with AdamW.

Here are my setting(cause about 66g for training).Please delete the # comment when you use:

CUDA_VISIBLE_DEVICES=0 python ../train_controlnet_flux.py \
    --pretrained_model_name_or_path=$MODEL_DIR \
    --dataset_name=fusing/fill50k \
    --max_train_samples=100 \
    --conditioning_image_column=conditioning_image \
    --image_column=image \
    --caption_column=text \
    --output_dir=$OUTPUT_DIR \
    --mixed_precision="bf16" \
    --resolution=1024 \
    --learning_rate=1e-5 \
    --max_train_steps=10 \
    --checkpointing_steps=11 \
    --validation_steps=1 \
    --validation_image "./conditioning_image_1.png" \
    --validation_prompt "red circle with blue background" \
    --num_validation_images=1 \
    --train_batch_size=1 \
    --gradient_accumulation_steps=2 \
    --report_to="wandb" \
    --num_double_layers=4 \
    --num_single_layers=0 \
    --seed=42 \
    --save_weight_dtype="bf16" \
    --push_to_hub \
    --enable_model_cpu_offload \ # will cause slower training
    --use_adafactor \  # save 10g memory

@PromeAIpro Thanks for your advice, I tried different settings. Finally find that the --gradient_checkpointing is quite useful to save some memory for reso 1024

Merge branch 'main' into flux-controlnet-train

66dfdbeb

fix back src/diffusers/pipelines/flux/pipeline_flux_controlnet.py

b097d0d6

vahidEttehadiAniml234 days ago

I am trying to run it on a multi-gpu machine. not working!

PromeAIpro234 days ago (edited 234 days ago)

@vahidEttehadiAniml sry, for multi gpu, I haven't tested it much, but you can definitely follow the process I give you in the readme and train in 40g a100 with deepspeed and accelerate.

fix dtype error by pre calculate text emb

49787e30

Merge branch 'main' into flux-controlnet-train

eb645575

PromeAIpro requested a review from

yiyixuxu 234 days ago

rm image save

e9d3e049

sayakpaul approved these changes on 2024-09-26

sayakpaul233 days ago

Thank you so much for your hard work!

Merge branch 'main' into flux-controlnet-train

7245c75f

quality fix

25fc313e

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

c2b44d34

PromeAIpro requested a review from

sayakpaul 233 days ago

kadirnar233 days ago

👀

Merge branch 'main' into flux-controlnet-train

bc2ea9eb

linjiapro233 days ago (edited 233 days ago)

Hi,

Just shamelessly plugging my ControlNet repo here which I just made public: https://github.com/christopher-beckham/flux-controlnet

Feel free to pick and choose things from the code if you think it could help with your PR. I have explained some of it in the README. While there is no public dataset associated with this repo I have trained with qint8 quantisation + 8-bit ADAM on a fairly large internal dataset and gotten more or less decent images on a 40GB GPU.

Some of the tricks mentioned here may also be of use: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

I think one key factor to reduce the GPU load is the following:

--quantize: quantise everything (except ControlNet) into int8 via the optimum-quanto library. This is weight only quantisation, so params are stored in int8 and are de-quantised on the fly. You may be able to squeeze out even more savings with lower bits but this has not been tested.

Can we add the above option to this PR?
cc @PromeAIpro

christopher-beckham233 days ago (edited 233 days ago)

@linjiapro: @sayakpaul envisioned the script as being more on the barebones side (see his above reply), though I would also argue that most people are not going to have access to (or not going to want to pay for) an 80gb GPU. Therefore, I would really argue the addition of quantisation.

Edit: edited my original post, I originally mentioned 8-bit ADAM would be nice to have but I see that it's in :)

Edit no2: from the above discussion it looks like the controlnet is being trained in fp32 however, it would be trivial to add an option to also train it in bf16 and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.

linjiapro233 days ago

Oh missed sayakpaul's comments. I was hoping that quantise does not change too much of the script. It is just a data format for the nets. I thought it is a flag, if you turn on, the nets are casted to int8. But it seems it is not that simple.

christopher-beckham233 days ago (edited 233 days ago)👍 1

It is that simple, as that's what optimum-quanto is designed to do. As to what extent one loses out on sample quality during training, I'm not sure (one just quantises the entire backbone, you can keep the controlnet in bf16). But in my own personal experience using it (with my repo) I never encountered any numerical instabilities and sample quality was on par with what I expected from other controlnets.

sayakpaul233 days ago

Edit no2: from the above discussion it looks like the controlnet is being trained in fp32 however, it would be trivial to add an option to also train it in bf16 and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.

@christopher-beckham thank you! WDYT about a follow-up PR to:

Enable training and saving in BF16
Add your repository in the README so that people can explore other ways

Would that work for you?

sayakpaul233 days ago👍 1

@PromeAIpro the test is failing:
https://github.com/huggingface/diffusers/actions/runs/11056617305/job/30718614922?pr=9324#step:9:356

We need to use a small checkpoint like:

diffusers/examples/dreambooth/test_dreambooth_lora_flux.py

Line 38 in 665c6b4

pretrained_model_name_or_path = "hf-internal-testing/tiny-flux-pipe"

fix test

2ee67c47

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

7ab1b808

Merge branch 'main' into flux-controlnet-train

56cd9840

sayakpaul233 days ago

Ah I see what is happening. First, we are using "https://github.com/huggingface/diffusers/actions/runs/11063243172/job/30739077215?pr=9324#step:9:268", which is a big model for a CI. Can we please follow what the rest of the ControlNet test follows i.e.,

Use a small and tiny base model.
Initialize ControlNet from the transformer?

PromeAIpro233 days ago (edited 233 days ago)

looks like a tokenizer_two error?

sayakpaul233 days ago

Regarding of the tokenizer, we still need to address the usage of small checkpoints.

BTW, how can I call this functiontest_controlnet_flux?

pytest examples/controlnet -k "test_controlnet_flux"

sayakpaul233 days ago

But you're using "--controlnet_model_name_or_path=promeai/FLUX.1-controlnet-lineart-promeai" in the test.

We don't use a pre-trained ControlNet model in the tests. We initialize it from the denoiser. For SD and SDXL, we initialize it from the UNet. We need to do something similar here.

PromeAIpro233 days ago

try using

  flux_controlnet = FluxControlNetModel.from_transformer(
        flux_transformer,
        num_layers=args.num_double_layers,
        num_single_layers=args.num_single_layers,
    )

but got error

BTW, the tokenizer loaded problem fixed by

    tokenizer_two = AutoTokenizer.from_pretrained(
        args.pretrained_model_name_or_path,
        subfolder="tokenizer_2",
        revision=args.revision,
-      use_fast=False,
    )

PromeAIpro233 days ago

thought i loaded tiny-flux-pipe correctlly, maybe a problem caused by controlnet. from_transformer?

sayakpaul233 days ago

Thanks for fixing the issue on tokenizer. Regarding initializing from the transformer, I think we're using because we're using :

--num_double_layers=4
--num_single_layers=0

Could we try:

--num_double_layers=2
--num_single_layers=1

PromeAIpro233 days ago

i just using --num_double_layers=1. --num_single_layers=0
i see the problem, config file seems to be loaded incorrectly

PromeAIpro233 days ago

Why do we need to update the parameter here? Shouldn't it be passed in by the transformer?

PromeAIpro233 days ago (edited 233 days ago)

I explicitly pass it in, and works

flux_controlnet = FluxControlNetModel.from_transformer(
            flux_transformer,
+            attention_head_dim=flux_transformer.config["attention_head_dim"],
+            num_attention_heads=flux_transformer.config["num_attention_heads"],
            num_layers=args.num_double_layers,
            num_single_layers=args.num_single_layers,
        )

sayakpaul233 days ago

I can replicate the error:

from diffusers import FluxTransformer2DModel, FluxControlNetModel

transformer = FluxTransformer2DModel.from_pretrained(
    "hf-internal-testing/tiny-flux-pipe", subfolder="transformer"
)
controlnet = FluxControlNetModel.from_transformer(
    transformer=transformer, num_layers=1, num_single_layers=1, attention_head_dim=16, num_attention_heads=1
)

Leads to:

RuntimeError: Error(s) in loading state_dict for CombinedTimestepTextProjEmbeddings:
        size mismatch for timestep_embedder.linear_1.weight: copying a param with shape torch.Size([32, 256]) from checkpoint, the shape in current model is torch.Size([16, 256]).
        size mismatch for timestep_embedder.linear_1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
        size mismatch for timestep_embedder.linear_2.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 16]).
        size mismatch for timestep_embedder.linear_2.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
        size mismatch for text_embedder.linear_1.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 32]).
        size mismatch for text_embedder.linear_1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
        size mismatch for text_embedder.linear_2.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 16]).
        size mismatch for text_embedder.linear_2.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).

Opened an issue here: #9540.

sayakpaul233 days ago (edited 233 days ago)🚀 1

@PromeAIpro could you make the changes accordingly then?

fix tiny flux train error

7cedfb1f

Merge branch 'flux-controlnet-train' of https://github.com/PromeAIpro…

ee6ca900

PromeAIpro233 days ago (edited 233 days ago)

I have tested it on my own machine and it works correctly.
BTW, added guidance handle for some flux_transformer that dont use guidance such as tiny-flux-pipe

PromeAIpro233 days ago

what about this

sayakpaul commented on 2024-09-27

Conversation is marked as resolved

Show resolved

sayakpaul233 days ago

https://github.com/huggingface/diffusers/pull/9324/files#r1777976358

change report to to tensorboard

dcac1b00

PromeAIpro requested a review from

sayakpaul 233 days ago

fix save name error when test

89a1f353

Fix shrinking errors

6ccd3e46

PromeAIpro233 days ago (edited 233 days ago)

looks good try it again!

$ pytest examples/controlnet -k "test_controlnet_flux"
===================================================================== test session starts ======================================================================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0
rootdir: /data3/home/srchen/test_diffusers/diffusers
configfile: pyproject.toml
collected 5 items / 4 deselected / 1 selected                                                                                                                  

examples/controlnet/test_controlnet.py .                                                                                                                 [100%]

=============================================================== 1 passed, 4 deselected in 25.87s ===============================================================

sayakpaul merged 534848c3 into main 233 days ago

sayakpaul233 days ago❤ 1

Thanks a lot for your contributions!

PromeAIpro233 days ago

Thank you for your guidance in my work!!

ScilenceForest219 days ago

Edit no2: from the above discussion it looks like the controlnet is being trained in fp32 however, it would be trivial to add an option to also train it in bf16 and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.

@christopher-beckham thank you! WDYT about a follow-up PR to:

Enable training and saving in BF16

Add your repository in the README so that people can explore other ways

Would that work for you?

Thank you guys for your work! @sayakpaul Does this reply indicate that BF16 is not currently supported, but I saw in a slightly earlier comment that the example parameters provided by @PromeAIpro included --mixed_precision="bf16"\ and --save_weight_dtype="bf16", what do they mean?Also, I understand that your design idea is to provide only simple and effective basic functionality, but I also found in sdxl's controlnet training scripts that there are some optimisation options such as --gradient_checkpointing --use_8bit_adam --set_grads_to_none --enable_xformers_memory_efficient_attention etc., so will similar performance optimisation options appear in this script subsequently?Thank you very much for your answers!

bc129697207 days ago

Here are some training results by lineart controlnet.

input output prompt
cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere
a busy urban intersection during daytime. The sky is partly cloudy with a mix of blue and white clouds. There are multiple traffic lights, and vehicles are seen waiting at the red signals. Several businesses and shops are visible on the side, with signboards and advertits. The road is wide, and there are pedestrian crossings. Overall, it appears to be a typical day in a bustling city.
First train on 512res and then fine-tune with 1024res

Hello, where can I find the dataset for training controlnet?Thanks

	862		model = models.pop()
	863
	864		# load diffusers style into model
	865		load_model = FluxControlNetModel.from_pretrained(input_dir, subfolder="controlnet")

	138
	139		## Notes
	140
	141		### T5 dont support bf16 autocast and i dont know why, will cause black image.
	142
	143		```diff
	144		if is_final_validation or torch.backends.mps.is_available():
	145		autocast_ctx = nullcontext()
	146		else:
	147		# t5 seems not support autocast and i don't know why
	148		+ autocast_ctx = nullcontext()
	149		- autocast_ctx = torch.autocast(accelerator.device.type)
	150		```

	149		- autocast_ctx = torch.autocast(accelerator.device.type)
	150		```
	151
	152		### TO Fix Error

871	871	encoder_hidden_states=prompt_embeds,
872		controlnet_block_samples=controlnet_block_samples,
873		controlnet_single_block_samples=controlnet_single_block_samples,
	872	controlnet_block_samples=[sample.to(dtype=latents.dtype) for sample in controlnet_block_samples]if controlnet_block_samples is not None else None,
	873	controlnet_single_block_samples=[sample.to(dtype=latents.dtype) for sample in controlnet_single_block_samples] if controlnet_single_block_samples is not None else None,

	1297		accelerator.wait_for_everyone()
	1298		if accelerator.is_main_process:
	1299		flux_controlnet = unwrap_model(flux_controlnet)
	1300		flux_controlnet.save_pretrained(args.output_dir)

diffusers
[examples] add train flux-controlnet scripts in example.
#9324

Merged

[examples] add train flux-controlnet scripts in example. #9324

What does this PR do?

Before submitting

Who can review?

	2
	3		The `train_controlnet_flux.py` script shows how to implement the ControlNet training procedure and adapt it for [FLUX](https://github.com/black-forest-labs/flux).
	4
	5		Training script provided by LibAI, which is an institution dedicated to the progress and achievement of artificial general intelligence.LibAI is the developer of [cutout.pro](https://www.cutout.pro/) and [promeai.pro](https://www.promeai.pro/).

	71
	72		```bash
	73		export MODEL_DIR="black-forest-labs/FLUX.1-dev"
	74		export OUTPUT_DIR="path to save model"
	75		export TRAIN_JSON_FILE="path to your jsonl file"

	1197		) * noise
	1198
	1199		guidance_vec = torch.full(
	1200		(noisy_latents.shape[0],), 3.5, device=noisy_latents.device, dtype=weight_dtype

	1186		noise = torch.randn_like(pixel_latents).to(accelerator.device).to(dtype=weight_dtype)
	1187		bsz = pixel_latents.shape[0]
	1188
	1189		# Sample a random timestep for each image

	1191
	1192		# apply flow matching
	1193		noisy_latents = (
	1194		1 - t.unsqueeze(1).unsqueeze(2).repeat(1, pixel_latents.shape[1], pixel_latents.shape[2])

	1203		controlnet_block_samples, controlnet_single_block_samples = flux_controlnet(
	1204		hidden_states=noisy_latents,
	1205		controlnet_cond=control_image,
	1206		timestep=t,

	1214
	1215		noise_pred = flux_transformer(
	1216		hidden_states=noisy_latents,
	1217		timestep=t,

	57		{"image": "xxx", "text": "xxx", "conditioning_image": "xxx"}
	58		{"image": "xxx", "text": "xxx", "conditioning_image": "xxx"}
	59		```
	60
	61
	62
	63
	64

	49		## Custom Datasets
	50
	51		We support dataset formats:
	52		The original dataset is hosted in the [ControlNet repo](https://huggingface.co/lllyasviel/ControlNet/blob/main/training/fill50k.zip). We re-uploaded it to be compatible with `datasets` [here](https://huggingface.co/datasets/fusing/fill50k). Note that `datasets` handles dataloading within the training script, To use our example, add `--dataset_name=fusing/fill50k \` to the script and remove line `--jsonl_for_train` mentioned below.

	100		--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
	101		--train_batch_size=1 \
	102		--gradient_accumulation_steps=4 \
	103		--report_to="tensorboard" \

	191		else:
	192		logger.warning(f"image logging not implemented for {tracker.name}")
	193
	194		del pipeline
	195		gc.collect()
	196		torch.cuda.empty_cache()

	719
	720
	721		def main(args):
	722		# if args.report_to == "wandb" and args.hub_token is not None:
	723		# raise ValueError(
	724		# "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
	725		# " Please use `huggingface-cli login` to authenticate with the Hub."
	726		# )

	980		flux_transformer.to(accelerator.device, dtype=weight_dtype)
	981		text_encoder_one.to(accelerator.device, dtype=weight_dtype)
	982		text_encoder_two.to(accelerator.device, dtype=weight_dtype)
	983		# flux_controlnet.to(accelerator.device, dtype=weight_dtype)

	1025		compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=100
	1026		)
	1027
	1028		del text_encoders, tokenizers
	1029		gc.collect()
	1030		torch.cuda.empty_cache()
	1031

	103		--report_to="tensorboard" \
	104		--num_double_layers=4 \
	105		--num_single_layers=0 \
	106		--seed=42 \

	108
	109		To better track our training experiments, we're using the following flags in the command above:
	110
	111		* `report_to="tensorboard` will ensure the training runs are tracked on Weights and Biases.

	119		"number of `args.validation_image` and `args.validation_prompt` should be checked in `parse_args`"
	120		)
	121
	122		image_logs = []
	123		if is_final_validation or torch.backends.mps.is_available():
	124		autocast_ctx = nullcontext()
	125		else:
	126		# t5 seems not support autocast and i don't know why
	127		autocast_ctx = nullcontext()
	128		# autocast_ctx = torch.autocast(accelerator.device.type)

	1138		)
	1139
	1140		# copied from pipeline_flux_controlnet
	1141		def _prepare_latent_image_ids(batch_size, height, width, device, dtype):
	1142		latent_image_ids = torch.zeros(height // 2, width // 2, 3)
	1143		latent_image_ids[..., 1] = latent_image_ids[..., 1] + torch.arange(height // 2)[:, None]
	1144		latent_image_ids[..., 2] = latent_image_ids[..., 2] + torch.arange(width // 2)[None, :]
	1145
	1146		latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
	1147
	1148		latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
	1149		latent_image_ids = latent_image_ids.reshape(
	1150		batch_size, latent_image_id_height * latent_image_id_width, latent_image_id_channels
	1151		)
	1152
	1153		return latent_image_ids.to(device=device, dtype=dtype)
	1154
	1155		def _pack_latents(latents, batch_size, num_channels_latents, height, width):
	1156		latents = latents.view(batch_size, num_channels_latents, height // 2, 2, width // 2, 2)
	1157		latents = latents.permute(0, 2, 4, 1, 3, 5)
	1158		latents = latents.reshape(batch_size, (height // 2) * (width // 2), num_channels_latents * 4)
	1159
	1160		return latents

	1327		step=global_step,
	1328		is_final_validation=True,
	1329		)
	1330		accelerator.end_training()

+> [!NOTE]
+> **Memory consumption**
+>
+> Flux can be quite expensive to run on consumer hardware devices and as a result, ControlNet training of it comes with higher memory requirements than usual.
+> **Gated access**
+> As the model is gated, before using it with diffusers you first need to go to the [FLUX.1 [dev] Hugging Face page](https://huggingface.co/black-forest-labs/FLUX.1-dev), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in: `huggingface-cli login`

	223		model_card = load_or_create_model_card(
	224		repo_id_or_path=repo_id,
	225		from_training=True,
	226		license="openrail++",

	216		model_description = f"""
	217		# controlnet-{repo_id}
	218
	219		These are controlnet weights trained on {base_model} with new type of conditioning.
	220		{img_str}

-These are controlnet weights trained on {base_model} with new type of conditioning.
-{img_str}
+These are controlnet weights trained on {base_model} with new type of conditioning.
+{img_str}
+## License
+Please adhere to the licensing terms as described [here](https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/LICENSE.md)

	134		# enable memory optimizations
	135		pipe.enable_model_cpu_offload()
	136
	137		control_image = load_image("./conditioning_image_1.png").resize((1024, 1024))

diffusers [examples] add train flux-controlnet scripts in example. #9324 Merged

[examples] add train flux-controlnet scripts in example. #9324

What does this PR do?

Before submitting

Who can review?

diffusers
[examples] add train flux-controlnet scripts in example.
#9324

Merged

	451		parser.add_argument(
	452		"--report_to",
	453		type=str,
	454		default="wandb",