@haofanwang @wangqixun
would you be willing to give this a review if you have time?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Can we have some sample training results (such as images) from this script attached in the doc, or anywhere suitable?
Here are some training results by lineart controlnet.
First train on 512res and then fine-tune with 1024res
103 | * `report_to="tensorboard` will ensure the training runs are tracked on Weights and Biases. | ||
104 | * `validation_image`, `validation_prompt`, and `validation_steps` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected. | ||
105 | |||
106 | Our experiments were conducted on a single 40GB A100 GPU. |
Wow, 40GB A100 seems doable.
I'm sorry, this is the 80g A100 (I wrote it wrong), I did a lot of extra work to get it to train with the zero3 on the 40g A100, but I don't think this is suitable for everyone
Not at all. I think it would still be nice to include the changes you had to make in the form of notes in the README. Does that work?
I'll see if I can add it later.
@sayakpaul We added a tutorial on configuring deepspeed in the readme.
There are some tricks to lower GPU:
With 1, 2, 3, can this thing be controlled to be trained under 40GB?
According to my practice, deepspeedzero3 must be used, @linjiapro your settings will cost about 70g when 1024 with bs 1 or 512 with bs 3.
sorry to bother you, have you ever tried cache text-encoder and vae latents to run with lower GPU? @PromeAIpro @linjiapro
cache text-encoder is already available in this script (saving about 10g of gpu memory on T5), about cache vae You can check how to use deepspeed in the readme, which includes cache vae.
fyi you can also reduce memory usage by using optimum-quanto
and qint8 quantising all of the modules except the controlnet (not activation quantisation, just the weights). I ran some experiments on this with my own controlnet training script and it seems to work just fine.
Hi, thanks for your PR. I just left some initial comments. LMK what you think.
Thanks! Appreciate your hard work here. Left some more comments.
Can we fix the code quality issues? make quality && make style
?
Thank you! Left some more comments. Let me know if they make sense or are unclear.
Left some additional minor comments but I see existing comments are yet to be addressed. Let me know when you would like another round of review.
@sayakpaul hey, I think I have fixed all the issues, time to start a new review.
1254 | bsz = pixel_latents.shape[0] | ||
1255 | noise = torch.randn_like(pixel_latents).to(accelerator.device).to(dtype=weight_dtype) | ||
1256 | # Sample a random timestep for each image | ||
1257 | # for weighting schemes where we sample timesteps non-uniformly | ||
1258 | u = compute_density_for_timestep_sampling( | ||
1259 | weighting_scheme=args.weighting_scheme, | ||
1260 | batch_size=bsz, | ||
1261 | logit_mean=args.logit_mean, | ||
1262 | logit_std=args.logit_std, | ||
1263 | mode_scale=args.mode_scale, | ||
1264 | ) | ||
1265 | indices = (u * noise_scheduler_copy.config.num_train_timesteps).long() | ||
1266 | timesteps = noise_scheduler_copy.timesteps[indices].to(device=pixel_latents.device) | ||
1267 | |||
1268 | # Add noise according to flow matching. | ||
1269 | sigmas = get_sigmas(timesteps, n_dim=pixel_latents.ndim, dtype=pixel_latents.dtype) | ||
1270 | noisy_model_input = (1.0 - sigmas) * pixel_latents + sigmas * noise |
I thought we were using a different timestep sampling procedure and I suggested to have that as a default. Are we not doing that anymore?
Do you mean to set the original sampling scheme as default?
For the weighting schema i just copied from here.
Yeah I meant to keep the sigmoid
sampling as your default and let users configure it as we do in the other scripts.
Okay. But it depends on an std and mean. IIRC your scheme did torch.randn()
and applied sigmoid right?
Yes, this uses torch.randn()
at first, but after given the examples you provided, I think this is maybe a better solution for us?
Left some comments but my concerns:
LMK if anything is unclear.
@PromeAIpro we didn't have to close this PR. Is there anything we could do to revive this PR? We could very much like to do that. Please let us know.
@PromeAIpro we didn't have to close this PR. Is there anything we could do to revive this PR? We could very much like to do that. Please let us know.
sry, i do it by mistake
Thanks. I think this is looking good. Some minor comments.
Also, we would need to add tests like in https://github.com/huggingface/diffusers/blob/main/examples/controlnet/test_controlnet.py.
@yiyixuxu could you review the changes made to the ControlNet pipeline?
Thanks. I think this is looking good. Some minor comments.
Also, we would need to add tests like in https://github.com/huggingface/diffusers/blob/main/examples/controlnet/test_controlnet.py.
@yiyixuxu could you review the changes made to the ControlNet pipeline?
added test in test_controlnet
@yiyixuxu could you review the changes made to the ControlNet Flux pipeline once you have a moment?
@PromeAIpro Hi great work, can this also train on Flux Schnell, or only dev rn.
@PromeAIpro Hi great work, can this also train on Flux Schnell, or only dev rn.
Training on Schnell seems to work but I had to set guidance=None
during the forward pass.
excellent job, but i have a question. I tried the scripts with 512 resolution, bf16, batch size 1 and it uses 76GB memory on a A800(80GB). And 1024 reso cannot be trained because of the memory. Any suggestions?
with following settings:
accelerate config:
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: '1'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
training script:
accelerate launch --main_process_port 29511 --config_file acc_config_singlegpu.yaml train_controlnet_flux.py
--pretrained_model_name_or_path="/home/export/base/ycsc_yaosy/yaosy/online1/models/black-forest-labs/FLUX.1-dev"
--jsonl_for_train="./controlnet_sdxl_train_5examples.jsonl"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="./controlnet_example_512"
--mixed_precision="bf16"
--resolution=512
--learning_rate=1e-5
--max_train_steps=15000
--validation_steps=5
--checkpointing_steps=200
--validation_image "test.jpg"
--validation_prompt "..."
--train_batch_size=1
--gradient_accumulation_steps=4
--report_to="tensorboard"
--num_double_layers=4
--num_single_layers=0
--seed=42
@ShunyuYao try to use --use_adafactor
as a Optimizer maybe?also by using the latest code, you can use --enable_model_cpu_offload
to run it in 1024res with AdamW.
Here are my setting(cause about 66g for training).Please delete the # comment when you use:
CUDA_VISIBLE_DEVICES=0 python ../train_controlnet_flux.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--dataset_name=fusing/fill50k \
--max_train_samples=100 \
--conditioning_image_column=conditioning_image \
--image_column=image \
--caption_column=text \
--output_dir=$OUTPUT_DIR \
--mixed_precision="bf16" \
--resolution=1024 \
--learning_rate=1e-5 \
--max_train_steps=10 \
--checkpointing_steps=11 \
--validation_steps=1 \
--validation_image "./conditioning_image_1.png" \
--validation_prompt "red circle with blue background" \
--num_validation_images=1 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 \
--report_to="wandb" \
--num_double_layers=4 \
--num_single_layers=0 \
--seed=42 \
--save_weight_dtype="bf16" \
--push_to_hub \
--enable_model_cpu_offload \ # will cause slower training
--use_adafactor \ # save 10g memory
@ShunyuYao I would try to precompute the text embeddings (and maybe the VAE outputs too) if possible. Those will save you a few gigabytes.
860 | 860 | joint_attention_kwargs=self.joint_attention_kwargs, | |
861 | 861 | return_dict=False, | |
862 | 862 | ) | |
863 | # ensure dtype |
why is this needed?
see discuss #9324 (comment)
and https://github.com/huggingface/diffusers/pull/9324/files/32eb1ef4897332954f3f0e967ff165e09e341ed8#r1758447457
we think rather writing convert code in train script, it better to writing them in pipeline. (now is writen both in training script and pipeline). It is just an ensure, and brings no effect in inference.
I looked at the comment, it is still not explained why it is needed
we have no issue running inference with the available controlnet checkpoint without this change.
it is right, the change of dtype convertion in pipeline was not relating much with controlnet training script, we found the dtype inconsistency issue when writing training script, it doesn't happen during inference now, but we fix that, it was a by-the-way. Maybe adapt this fix in a new issue when dtype inconsistency happens in future ?
we hold neutral position towards that, how do you think? @yiyixuxu @sayakpaul
yes, a separate issue would be nice! and maybe a minimum reproducible script to help understand the issue
This is because t5 does not support autocast (causing black images). However, during validation, our controlnet is fp32 and our transformer is bf16, so we need to explicitly convert the dtype in the pipeline.
start with an new issue #9527
but validation is to log outputs - why cannot we run controlnet in bf16 too? anyways I think this change should not be in pipelines for now:)
yes, Is there a way for diffusers to clone controlnet? we consider cloning a copy and converting it to bf16 for validation, If we directly convert the original weights, we will lose precision.
The fundamental solution is to support the autocast problem of t5 here(#9527)
Okay something that would work is the following:
Would this work?
works, now the change of pipeline is removed
Hi,
Just shamelessly plugging my ControlNet repo here which I just made public: https://github.com/christopher-beckham/flux-controlnet
Feel free to pick and choose things from the code if you think it could help with your PR. I have explained some of it in the README. While there is no public dataset associated with this repo I have trained with qint8
quantisation + 8-bit ADAM on a fairly large internal dataset and gotten more or less decent images on a 40GB GPU.
Some of the tricks mentioned here may also be of use: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md
@christopher-beckham thanks for sharing your work! Looks very cool!
The purpose of the scripts within examples
(at least the ones we officially maintain at the moment) is to provide barebones. So, I think it's okay for the moment to skip with quantization related bits and other things.
The simplest reasonable defaults that lead to okay results are fine, IMO. So, what we could do is provide mentions to the other popular ControlNet trainers like yours from the README in case users want to take things further. I hope that works.
Just reviewed it!
I think it looks quite good, apart from @yiyixuxu's concerns here: #9324 (comment).
I would probably lean towards doing it from the training script because otherwise, it would add more maintenance. But I will let Yiyi comment further.
@PromeAIpro could you please follow the instructions from the CI and ensure the core quality checks pass?
Just reviewed it!
I think it looks quite good, apart from @yiyixuxu's concerns here: #9324 (comment).
I would probably lean towards doing it from the training script because otherwise, it would add more maintenance. But I will let Yiyi comment further.
@PromeAIpro could you please follow the instructions from the CI and ensure the core quality checks pass?
yes, i know it, This is because t5 does not support autocast (causing black images). However, during validation, our controlnet is fp32 and our transformer is bf16, so we need to explicitly convert the dtype in the pipeline.
@sayakpaul already run make style
and make quality
,Is there anything else that needs to be done?
@sayakpaul not quite sure about this.
Can you try the following?
diffusers
and run `pip install -e ".[quality]"make style && make quality
?@sayakpaul works! try to check it again
@ShunyuYao try to use
--use_adafactor
as a Optimizer maybe?also by using the latest code, you can use--enable_model_cpu_offload
to run it in 1024res with AdamW.Here are my setting(cause about 66g for training).Please delete the # comment when you use:
CUDA_VISIBLE_DEVICES=0 python ../train_controlnet_flux.py \ --pretrained_model_name_or_path=$MODEL_DIR \ --dataset_name=fusing/fill50k \ --max_train_samples=100 \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir=$OUTPUT_DIR \ --mixed_precision="bf16" \ --resolution=1024 \ --learning_rate=1e-5 \ --max_train_steps=10 \ --checkpointing_steps=11 \ --validation_steps=1 \ --validation_image "./conditioning_image_1.png" \ --validation_prompt "red circle with blue background" \ --num_validation_images=1 \ --train_batch_size=1 \ --gradient_accumulation_steps=2 \ --report_to="wandb" \ --num_double_layers=4 \ --num_single_layers=0 \ --seed=42 \ --save_weight_dtype="bf16" \ --push_to_hub \ --enable_model_cpu_offload \ # will cause slower training --use_adafactor \ # save 10g memory
@PromeAIpro Thanks for your advice, I tried different settings. Finally find that the --gradient_checkpointing is quite useful to save some memory for reso 1024
@vahidEttehadiAniml sry, for multi gpu, I haven't tested it much, but you can definitely follow the process I give you in the readme and train in 40g a100 with deepspeed and accelerate.
Thank you so much for your hard work!
👀
Hi,
Just shamelessly plugging my ControlNet repo here which I just made public: https://github.com/christopher-beckham/flux-controlnet
Feel free to pick and choose things from the code if you think it could help with your PR. I have explained some of it in the README. While there is no public dataset associated with this repo I have trained with
qint8
quantisation + 8-bit ADAM on a fairly large internal dataset and gotten more or less decent images on a 40GB GPU.Some of the tricks mentioned here may also be of use: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md
I think one key factor to reduce the GPU load is the following:
--quantize: quantise everything (except ControlNet) into int8 via the optimum-quanto library. This is weight only quantisation, so params are stored in int8 and are de-quantised on the fly. You may be able to squeeze out even more savings with lower bits but this has not been tested.
Can we add the above option to this PR?
cc @PromeAIpro
@linjiapro: @sayakpaul envisioned the script as being more on the barebones side (see his above reply), though I would also argue that most people are not going to have access to (or not going to want to pay for) an 80gb GPU. Therefore, I would really argue the addition of quantisation.
Edit: edited my original post, I originally mentioned 8-bit ADAM would be nice to have but I see that it's in :)
Edit no2: from the above discussion it looks like the controlnet is being trained in fp32
however, it would be trivial to add an option to also train it in bf16
and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.
Oh missed sayakpaul's comments. I was hoping that quantise does not change too much of the script. It is just a data format for the nets. I thought it is a flag, if you turn on, the nets are casted to int8. But it seems it is not that simple.
It is that simple, as that's what optimum-quanto
is designed to do. As to what extent one loses out on sample quality during training, I'm not sure (one just quantises the entire backbone, you can keep the controlnet in bf16). But in my own personal experience using it (with my repo) I never encountered any numerical instabilities and sample quality was on par with what I expected from other controlnets.
Edit no2: from the above discussion it looks like the controlnet is being trained in fp32 however, it would be trivial to add an option to also train it in bf16 and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.
@christopher-beckham thank you! WDYT about a follow-up PR to:
Would that work for you?
@PromeAIpro the test is failing:
https://github.com/huggingface/diffusers/actions/runs/11056617305/job/30718614922?pr=9324#step:9:356
We need to use a small checkpoint like:
Ah I see what is happening. First, we are using "https://github.com/huggingface/diffusers/actions/runs/11063243172/job/30739077215?pr=9324#step:9:268", which is a big model for a CI. Can we please follow what the rest of the ControlNet test follows i.e.,
Regarding of the tokenizer, we still need to address the usage of small checkpoints.
BTW, how can I call this functiontest_controlnet_flux?
pytest examples/controlnet -k "test_controlnet_flux"
But you're using "--controlnet_model_name_or_path=promeai/FLUX.1-controlnet-lineart-promeai" in the test.
We don't use a pre-trained ControlNet model in the tests. We initialize it from the denoiser. For SD and SDXL, we initialize it from the UNet. We need to do something similar here.
try using
flux_controlnet = FluxControlNetModel.from_transformer(
flux_transformer,
num_layers=args.num_double_layers,
num_single_layers=args.num_single_layers,
)
but got error
BTW, the tokenizer loaded problem fixed by
tokenizer_two = AutoTokenizer.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="tokenizer_2",
revision=args.revision,
- use_fast=False,
)
Thanks for fixing the issue on tokenizer. Regarding initializing from the transformer, I think we're using because we're using :
--num_double_layers=4
--num_single_layers=0
Could we try:
--num_double_layers=2
--num_single_layers=1
I explicitly pass it in, and works
flux_controlnet = FluxControlNetModel.from_transformer(
flux_transformer,
+ attention_head_dim=flux_transformer.config["attention_head_dim"],
+ num_attention_heads=flux_transformer.config["num_attention_heads"],
num_layers=args.num_double_layers,
num_single_layers=args.num_single_layers,
)
I can replicate the error:
from diffusers import FluxTransformer2DModel, FluxControlNetModel
transformer = FluxTransformer2DModel.from_pretrained(
"hf-internal-testing/tiny-flux-pipe", subfolder="transformer"
)
controlnet = FluxControlNetModel.from_transformer(
transformer=transformer, num_layers=1, num_single_layers=1, attention_head_dim=16, num_attention_heads=1
)
Leads to:
RuntimeError: Error(s) in loading state_dict for CombinedTimestepTextProjEmbeddings:
size mismatch for timestep_embedder.linear_1.weight: copying a param with shape torch.Size([32, 256]) from checkpoint, the shape in current model is torch.Size([16, 256]).
size mismatch for timestep_embedder.linear_1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for timestep_embedder.linear_2.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 16]).
size mismatch for timestep_embedder.linear_2.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for text_embedder.linear_1.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 32]).
size mismatch for text_embedder.linear_1.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for text_embedder.linear_2.weight: copying a param with shape torch.Size([32, 32]) from checkpoint, the shape in current model is torch.Size([16, 16]).
size mismatch for text_embedder.linear_2.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
Opened an issue here: #9540.
@PromeAIpro could you make the changes accordingly then?
I have tested it on my own machine and it works correctly.
BTW, added guidance handle for some flux_transformer that dont use guidance such as tiny-flux-pipe
looks good try it again!
$ pytest examples/controlnet -k "test_controlnet_flux"
===================================================================== test session starts ======================================================================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0
rootdir: /data3/home/srchen/test_diffusers/diffusers
configfile: pyproject.toml
collected 5 items / 4 deselected / 1 selected
examples/controlnet/test_controlnet.py . [100%]
=============================================================== 1 passed, 4 deselected in 25.87s ===============================================================
Thanks a lot for your contributions!
Thank you for your guidance in my work!!
Edit no2: from the above discussion it looks like the controlnet is being trained in fp32 however, it would be trivial to add an option to also train it in bf16 and I had no issues with it. And maybe you'd avoid the autocast issue altogether for the validation logging.
@christopher-beckham thank you! WDYT about a follow-up PR to:
- Enable training and saving in BF16
- Add your repository in the README so that people can explore other ways
Would that work for you?
Thank you guys for your work! @sayakpaul Does this reply indicate that BF16 is not currently supported, but I saw in a slightly earlier comment that the example parameters provided by @PromeAIpro included --mixed_precision="bf16"\ and --save_weight_dtype="bf16", what do they mean?Also, I understand that your design idea is to provide only simple and effective basic functionality, but I also found in sdxl's controlnet training scripts that there are some optimisation options such as --gradient_checkpointing --use_8bit_adam --set_grads_to_none --enable_xformers_memory_efficient_attention etc., so will similar performance optimisation options appear in this script subsequently?Thank you very much for your answers!
Here are some training results by lineart controlnet.
input output prompt
![]()
cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere
![]()
a busy urban intersection during daytime. The sky is partly cloudy with a mix of blue and white clouds. There are multiple traffic lights, and vehicles are seen waiting at the red signals. Several businesses and shops are visible on the side, with signboards and advertits. The road is wide, and there are pedestrian crossings. Overall, it appears to be a typical day in a bustling city.
First train on 512res and then fine-tune with 1024res
Hello, where can I find the dataset for training controlnet?Thanks
Login to write a write a comment.
What does this PR do?
In this commit we add train flux-controlnet scripts in examples, and tested it on A100-SXM4-80GB.
Using this train script, We can customize the number of layers of the transformer, by setting
--num_double_layers=4 --num_single_layers=0
, by this setting, the GPU memory demand is 60G, with batchsize 2, and 512 resolution.discussed in #9085
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.