sayakpaul269 days ago (edited 246 days ago)❤ 8

What does this PR do?

Come back later.

Quantization config class (base and bitsandbytes)
Quantizer class (base and bitsandbytes)
Utilities related to bitsandbytes
from_pretrained() at the ModelMixin level and related changes
save_pretrained()
NF4 tests
INT8 (llm.int8()) tests
Docs

Notes

Even though I alluded to having a separate QuantizationLoaderMixin in #9174, I realized that is not an approach we can take because loading and saving a quantized model is very much baked into the arguments of ModelMixin.save_pretrained() and ModelMixin.from_pretrained(). It is deeply entangled.
For the initial quantization support, I think it's okay to not allow passing device_map, because for a pipeline, multiple device_maps can get ugly. This will be dealt with in a follow-up PR by @SunMarc and myself.
For the point above, for checkpoints that are found to be sharded (Flux, for example), I have decided to merge them on CPU to simplify the implementation. This will be dealt with in a follow-up PR by @SunMarc.
The PR has an extensive testing suite covering training, too. However, I have decided not to add it to our CI yet. We should first let this feature flow into the community and then add the tests to our nightly CI.

No-frills code snippets

Serialization

import torch 
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
from accelerate.utils import compute_module_sizes

model_id = "black-forest-labs/FLUX.1-dev"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
    model_id, subfolder="transformer", quantization_config=nf4_config, torch_dtype=torch.bfloat16
)
assert model_nf4.dtype == torch.uint8, model_nf4.dtype
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)
print(compute_module_sizes(model_nf4)[""] / 1024 / 1024)

push_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4.push_to_hub(push_id)

Serialized checkpoint: https://huggingface.co/sayakpaul/flux.1-dev-nf4-with-bnb-integration.

NF4 checkpoints of Flux transformer and T5: https://huggingface.co/sayakpaul/flux.1-dev-nf4-pkg (has Colab Notebooks, too).

Inference

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")

quantization config.

e634ff24

sayakpaul added quantization

sayakpaul requested a review from

DN6 269 days ago

sayakpaul requested a review from

SunMarc 269 days ago

fix-copies

02a6dffd

HuggingFaceDocBuilderDev269 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merge branch 'main' into quantization-config

c385a2bb

Merge branch 'main' into quantization-config

0355875d

SunMarc commented on 2024-08-20

SunMarc267 days ago

Thanks for adding this ! I see that you used a lot of things from transformers. Do you think it is possible to import these (or inherit) from transformers ? This will help reducing the maintenance. I'm fine also doing that since there are not too many follow-up PR after a quantizer has been added. About the HfQuantizer class, there are a lot of methods that were created to fit transformers structure. I'm not sure we will need eveyone of these methods in diffusers. Ofc, we can still do a follow-up PR to clean up.

Conversation is marked as resolved

Show resolved

sayakpaul267 days ago (edited 266 days ago)👍 1

@SunMarc I am guilty as charged but we don’t have transformers as a hard dependency for loading models in Diffusers. Pinging @DN6 to seek his opinion.

Update: Chatted with @DN6 as well. We think it's better to redefine inside diffusers without the transformers specific bits which we can clean in this PR.

Merge branch 'main' into quantization-config

e41b4949

Merge branch 'main' into quantization-config

dfb33eb2

Merge branch 'main' into quantization-config

e4926555

fix

6e86cc06

sayakpaul266 days ago

@SunMarc I think this PR is ready for another review.

modules_to_not_convert

58a3d156

Merge branch 'main' into quantization-config

1d477f9a

SunMarc approved these changes on 2024-08-22

SunMarc265 days ago

Thanks for adding this @sayakpaul !

Conversation is marked as resolved

Show resolved

yiyixuxu commented on 2024-08-22

yiyixuxu265 days ago (edited 265 days ago)

I don't think it makes sense to have this as a separate PR to add a base class because it's hard to understand what methods are needed - we should only introduce a minimum base class and gradually add functionalities as needed

can we have a PR with a minimum example working?

sayakpaul265 days ago (edited 265 days ago)

Okay, so, do you want me to add everything needed for bitsandbytes integration in this PR? But do note that this won’t be very different from what we have in transformers.

yiyixuxu265 days ago (edited 265 days ago)

@sayakpaul
I think so because:

it is better to review that way
we don't need this class in diffusers on its own because it cannot be used yet, no?

bghira265 days ago

sometimes we can make a feature branch where a bunch of PRs can be merged into before one big honkin' PR is pushed to main at the end. and the pieces are all individually reviewed and can be tested. is this a viable approach for including quantisation?

Merge branch 'main' into quantization-config

bd7f46d5

sayakpaul265 days ago❤ 3

Okay I will update this branch. @yiyixuxu

SunMarc264 days ago (edited 264 days ago)

cc @MekkCyber for visibility

DN6259 days ago👍 1

Just a few considerations for the quantization design.

I would say the initial design should start loading/inference at just the model level and then proceed to add functionality (pipeline level loading etc).

The feature needs to perform the following functions

Perform on the fly quantization of large models so that they can be loaded in a low-memory dtype
1. with from_pretrained
2. with from_single_file
Dynamically upcast to the appropriate compute dtype when running inference
Save/Load already quantized versions of these large models (FP8, NF4)
Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

At the moment, the most common ask seems to be the ability to load models into GPU using the FP8 dtype and run inference in a supported dtype by dynamically upcasting the necessary layers. NF4 is another format that's gaining attention.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Some example quantized versions of models that have been doing the rounds

Flux FP8:
- https://huggingface.co/Kijai/flux-fp8 (single file format)
- https://huggingface.co/XLabs-AI/flux-dev-fp8 (quanto/diffusers format)
Flux NF4
- https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4
SD3 FP8:
- https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors (single file pipeline)

To cover these initial cases, we can rely on Quanto (FP8) and BitsandBytes (NF4).

Example API:

from diffusers import FluxPipeline, FluxTransformer2DModel, DiffusersQuantoConfig

# Load model in FP8 with Quanto and perform compute in configured dtype. 

quantization_config = DiffusersQuantoConfig(weights="float8", compute_dtype=torch.bfloat16)

FluxTransformer2DModel.from_pretrained("<either diffusers format or quanto format weights>", quantization_config=quantization_config)

pipe = FluxPipeline.from_pretrained("...", transformer=transformer)

The quantization config should probably take the following arguments

DiffusersQuantoConfig(
	weights_dtype="", # dtype to store weights
	compute_dtype="", # dtype to perform inference
	skip_quantize_modules=["ResBlock"]
)

I think initially we can rely on the dynamic upcasting operations performed by Quanto and BnB under the hood to start and then expand on them if needed.

Some other considerations

Since we have transformers models in diffusers that can also benefit from quantized loading, we might want to consider adding a Diffusers prefix to the quantization configs. e.g DiffusersQuantoConfig so that when we import quantization configs from transformers there aren't any conflicts.
For saving and loading models we can start with models saved in Quanto/BnB format.
One possible challenge with Pipeline level quantized loading is that we have a mix of transformers/diffusers models. So a single config to quantize/load both types might not be possible.
Single file loading has it's own set of issues, such as dealing with checkpoints that have been naively quantized. This applies to some of the Flux single file checkpoints. e.g. safetensors.torch.save_file(model.to(torch.float8_e4m3fn), "model-fp8.safetensors) and loading full pipeline single file checkpoints. But we can address these later.

sayakpaul259 days ago (edited 241 days ago)

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Concretely, I would like to stick to the outline of the changes laid out in #9174 (along with anything related) for this PR.

The feature needs to perform the following functions

I won't advocate doing all of that in a single PR because it makes things very hard to review. We would rather want to move faster with something more minimal, confirming their effectiveness.

Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

Well, note that if the underlying LoRA wasn't trained with the base quantization precision, it might not perform as expected.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Please note that bitsandbytes related quantization mostly applies to nn.linear whereas quanto is broader in their scopes (i.e, quanto can be applied to an nn.Conv2D as well).

DN6259 days ago👍 1

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Sounds good to me.

For this PR lets do

from_pretrained only
bnb quantization.

Merge branch 'main' into quantization-config

d5d7bb69

Merge branch 'main' into quantization-config

44c8a751

add bitsandbytes utilities.

6a0fcdc2

make progress.

e4590fa7

Merge branch 'main' into quantization-config

77a14389

fixes

335ab6bd

quality

d44ef851

up

210fa1e5

sayakpaul marked this pull request as draft 258 days ago

up

f4feee1d

sayakpaul force pushed from b77f70f8 to f4feee1d 258 days ago

Merge branch 'main' into quantization-config

e8c17224

Merge branch 'main' into quantization-config

7f86a71a

minor

ba671b62

up

c1a9f13b

Merge branch 'main' into quantization-config

4489c544

up

f2ca5e26

fix

d6b89542

sayakpaul changed the title ~~[Quantization] Add quantization config base class~~ [Quantization] Add quantization support for `bitsandbytes` 258 days ago

sayakpaul commented on 2024-08-30

src/diffusers/models/modeling_utils.py

128	131	_supports_gradient_checkpointing = False
129	132	_keys_to_ignore_on_load_unexpected = None
130	133	_no_split_modules = None
	134	_keep_in_fp32_modules = []

sayakpaul258 days ago

We have to introduce this attribute now that we're seriously entering the diffusion territory.

provide credits where due.

45029e26

chuck-ma257 days ago (edited 257 days ago)

If i load lora after quantization, it throws errors:

ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

pipe.load_lora_weights(
            hf_hub_download(repo_name, ckpt_name, adapter_name="ckpt_name"
        )

sayakpaul257 days ago (edited 257 days ago)

@chuck-ma there's a reason why this PR is still in draft :) We will consider these use cases a bit later in the pipeline.

However, you are welcome to try out basic functionalities like loading and saving without LoRAs.

chuck-ma257 days ago

@chuck-ma there's a reason why this PR is still in draft :) We will consider these use cases a bit later in the pipeline.

However, you are welcome to try out basic functionalities like loading and saving without LoRAs.

Gotcha, thanks.

make configurations work.

4eb468ad

fixes

939965de

sayakpaul commented on 2024-08-30

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-08-30

Conversation is marked as resolved

Show resolved

Merge branch 'main' into quantization-config

85571660

fix

d098d073

update_missing_keys

c4a00749

fix

ee45612c

chuck-ma257 days ago (edited 257 days ago)

I got this error:

TypeError Traceback (most recent call last)
File ~/autodl-tmp/diffusers/src/diffusers/models/model_loading_utils.py:134, in load_state_dict(checkpoint_file, variant)
133 try:
--> 134 file_extension = os.path.basename(checkpoint_file).split(".")[-1]
135 if file_extension == SAFETENSORS_FILE_EXTENSION:

File ~/miniconda3/lib/python3.10/posixpath.py:142, in basename(p)
141 """Returns the final component of a pathname"""
--> 142 p = os.fspath(p)
143 sep = _get_sep(p)

TypeError: expected str, bytes or os.PathLike object, not NoneType

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
Cell In[7], line 11
4 model_id = "black-forest-labs/FLUX.1-dev"
6 nf4_config = BitsAndBytesConfig(
7 load_in_4bit=True,
8 bnb_4bit_quant_type="nf4",
9 bnb_4bit_compute_dtype=torch.bfloat16,
10 )
---> 11 model_nf4 = FluxTransformer2DModel.from_pretrained(
12 model_id, subfolder="transformer", quantization_config=nf4_config
13 )
14 print(model_nf4.dtype)
15 print(model_nf4.quantization_config)

File ~/miniconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.._inner_fn(*args, **kwargs)
111 if check_use_auth_token:
112 kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.name, has_token=has_token, kwargs=kwargs)
--> 114 return fn(*args, **kwargs)

File ~/autodl-tmp/diffusers/src/diffusers/models/modeling_utils.py:817, in ModelMixin.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
815 if device_map is None and not is_sharded or (hf_quantizer is not None):
816 param_device = "cpu"
--> 817 state_dict = load_state_dict(model_file, variant=variant)
818 model._convert_deprecated_attention_blocks(state_dict)
819 # move the params from meta device to cpu

File ~/autodl-tmp/diffusers/src/diffusers/models/model_loading_utils.py:146, in load_state_dict(checkpoint_file, variant)
144 except Exception as e:
145 try:
--> 146 with open(checkpoint_file) as f:
147 if f.read().startswith("version"):
148 raise OSError(
149 "You seem to have cloned a repository without having git-lfs installed. Please install "
150 "git-lfs and run git lfs install followed by git lfs pull in the folder "
151 "you cloned."
152 )

TypeError: expected str, bytes or os.PathLike object, not NoneType

import torch 
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel

model_id = "black-forest-labs/FLUX.1-dev"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
    model_id, subfolder="transformer", quantization_config=nf4_config
)
print(model_nf4.dtype)
print(model_nf4.quantization_config)

if i don't use this, everything is just fine:

import torch 
from diffusers import FluxTransformer2DModel


model_id = "black-forest-labs/FLUX.1-dev"


model_nf4 = FluxTransformer2DModel.from_pretrained(
    model_id, subfolder="transformer", 
)
print(model_nf4.dtype)

fix

b24c0a7a

make it work.

473505ca

fix

c795c82d

Merge branch 'main' into quantization-config

c1d5b966

sayakpaul256 days ago

@chuck-ma do you wanna give this a try now?

provide credits to transformers.

af7cacaf

lonngxiang256 days ago (edited 256 days ago)

run error:ImportError: Using bitsandbytes 4-bit quantization requires the latest version of bitsandbytes: pip install -U bitsandbytes

bitsandbytes 0.43.3

sayakpaul256 days ago (edited 256 days ago)

@lonngxiang

Maybe there's something wrong with your setup but I am able to load things without any issues:
https://colab.research.google.com/gist/sayakpaul/1bfcf2441f73364bb06c801f58303cd5/scratchpad.ipynb

I see you also confirmed it's working here.

empty commit

80967f5e

sayakpaul requested a review from

SunMarc 256 days ago

handle to() better.

3bdf25a7

tests

27415cc1

change to bnb from bitsandbytes

51cac09a

chuck-ma254 days ago👍 1

@chuck-ma do you wanna give this a try now?

Now it works.

fix tests

15f30326

sayakpaul force pushed from 2e42d257 to 15f30326 254 days ago

sayakpaul commented on 2024-09-02

src/diffusers/configuration_utils.py

527	527
528	528	# 4. Give nice warning if unexpected values have been passed
529		if len(config_dict) > 0:
	529	only_quant_config_remaining = len(config_dict) == 1 and "quantization_config" in config_dict

sayakpaul254 days ago

Because quantization_config isn't a part of any model's __init__().

yiyixuxu243 days ago

I think it is better to not add to cofig_dict if it is not going into __init__, i.e. at line 511

 # remove private attributes
 config_dict = {k: v for k, v in config_dict.items() if not k.startswith("_")}
# remove quantization_config
 config_dict = {k: v for k, v in config_dict.items() if k != "quantization_config")}

sayakpaul240 days ago

We cannot remove quantization_config from the config of a model as that would prevent loading of the quantized models via from_pretrained().

quantization_config isn't used for initializing a model, it's used to determine what kind of quantization configuration to inject inside the given model. This is why it's only used in from_pretrained() of ModelMixin.

LMK if you have a better idea to handle it.

yiyixuxu226 days ago👍 1

we do not remove them from the config, just not adding to the config_dict inside this extract_init_dict method: basically, the cofig_dict in this function goes through these steps:

it is used to create init_dict: the quantisation config will not go there, so it is not affected if we do not add it to config_dict
it is used to throw a warning after we createdinit_dict, if the quantisation configs were not there, we do not need to throw a warning for it
it goes into unused_kwargs - so I think this is the only difference it would make, do we need the quantisation config to be in unused_kwargs returned by extract_init_dict? I think unused_kwargs is only used to send additional warnings for unexpected stuff, but since quantisation config is expected, and we have already decided not to send a warning here inside extract_init_dict - I think it does not need to go to the unused_kwargs here?

    @classmethod
    def extract_init_dict(cls, config_dict, **kwargs):
         ...
        config_dict = {k: v for k, v in config_dict.items() if k not in used_defaults and k != "_use_default_values"

        # remove private attributes
        config_dict = {k: v for k, v in config_dict.items() if not k.startswith("_")}
        
+      # remove quantization_config
+      config_dict = {k: v for k, v in config_dict.items() if k != "quantization_config")}
        
        
        ## here we use config_dict to create `init_dict` which will be passed to `__init__` method
        init_dict = {}
        for key in expected_keys:
                 ...
                init_dict[key] = config_dict.pop(key)
-      only_quant_config_remaining = len(config_dict) == 1 and "quantization_config" in config_dict
-      if len(config_dict) > 0 and not only_quant_config_remaining:
+     if len(config_dict) > 0:
            logger.warning(
                f"The config attributes {config_dict} were passed to {cls.__name__}, "
                "but are not expected and will be ignored. Please verify your "
                f"{cls.config_name} configuration file."
            )
       ....
        # 6. Define unused keyword arguments
        unused_kwargs = {**config_dict, **kwargs}

        return init_dict, unused_kwargs, hidden_config_dict

sayakpaul216 days ago

Makes sense. Resolved in 555a5ae.

sayakpaul commented on 2024-09-02

src/diffusers/models/model_loading_utils.py

	173	keep_in_fp32_modules=None,
139	174	) -> List[str]:
140		device = device or torch.device("cpu")
	175	device = device or torch.device("cpu") if hf_quantizer is None else device

sayakpaul254 days ago

More on this in the later changes.

sayakpaul commented on 2024-09-02

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-02

Conversation is marked as resolved

Show resolved

better safeguard.

77c9fdb3

sayakpaul commented on 2024-09-02

src/diffusers/models/model_loading_utils.py

	202		else:
	203		param = param.to(dtype)
	204
	205		is_quant_method_bnb = getattr(model, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES
	206		if not is_quantized and not is_quant_method_bnb and empty_state_dict[param_name].shape != param.shape:

sayakpaul254 days ago

Because bnb quantized params are usually flattened.

sayakpaul commented on 2024-09-02

src/diffusers/pipelines/pipeline_utils.py

44	44	from ..models import AutoencoderKL
45	45	from ..models.attention_processor import FusedAttnProcessor2_0
46	46	from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, ModelMixin
	47	from ..quantizers.bitsandbytes.utils import _check_bnb_status

sayakpaul254 days ago👍 1

Reason why I preferred not having _check_bnb_status() inline is because I imagine it'd be used across the library. So, didn't make sense to include it inside of a function.

BenjaminBossan commented on 2024-09-02

BenjaminBossan254 days ago

Wow, big PR, great feature being added here.

I haven't done an in-depth review, but took a look at the parts related to PEFT and skimmed the rest.

With such a big change, it might be worth it to control the line coverage of the newly added tests to ensure that the new code is reasonably well covered.

Conversation is marked as resolved

Show resolved

src/diffusers/models/model_loading_utils.py

	173	keep_in_fp32_modules=None,
139	174	) -> List[str]:
140		device = device or torch.device("cpu")
	175	device = device or torch.device("cpu") if hf_quantizer is None else device

BenjaminBossan254 days ago

Not specific to this PR but device = device or torch.device("cpu") is a bit dangerous because theoretically, 0 is a valid device but it would be considered falsy. AFAICT it's not problematic for the existing code, but something to keep in mind.

sayakpaul254 days ago

Indeed.

sayakpaul251 days ago

I have added a comment about it too.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/quantizers/bitsandbytes/utils.py

	318		"""
	319		Converts a quantized model into its dequantized original version. The newly converted model will have some
	320		performance drop compared to the original model before quantization - use it only for specific usecases such as
	321		QLoRA adapters merging.

BenjaminBossan254 days ago

Note that PEFT supports merging into bnb weights, so that alone would not require dequantizing the weights entirely.

sayakpaul254 days ago

Noted. I guess not immediately relevant for this PR?

SunMarc252 days ago👍 1

I think it is still interesting to let users have a way to dequantize their models.

change merging status

ddc9f293

courtesy to transformers.

44c41099

move upper.

27666a8d

better

3464d837

Merge branch 'main' into quantization-config

b106124a

chuck-ma254 days ago (edited 254 days ago)

Thanks for your efforts. I think it's better if we can load the transformer that has been quantized instead of quantizing the transformer every time we load it. @sayakpaul

sayakpaul254 days ago

I think it's better if we can load the transformer that has been quantized instead of quantizing the transformer every time we load it. @sayakpaul

Possible now:

from diffusers import FluxTransformer2DModel

model_id = "sayakpaul/flux.1-dev-nf4-pkg"
model_nf4 = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")

Not sure what made you think it's not possible.

Merge branch 'main' into quantization-config

330fa0af

chuck-ma254 days ago

I think it's better if we can load the transformer that has been quantized instead of quantizing the transformer every time we load it. @sayakpaul

Possible now:
from diffusers import FluxTransformer2DModel

model_id = "sayakpaul/flux.1-dev-nf4-pkg"
model_nf4 = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
Not sure what made you think it's not possible.

import torch 
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel

model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16


pipe.transformer.save_pretrained(flux_transformer_id)

model_nf4 = FluxTransformer2DModel.from_pretrained(
    flux_transformer_id,
    
    
    # quantization_config=nf4_config,
    
    torch_dtype=dtype,
)

I just got an error:(no matter if i use quantization_config)

ValueError: Cannot load <class 'diffusers.models.transformers.transformer_flux.FluxTransformer2DModel'> from /root/autodl-tmp/flux_transformer because the following keys are missing:
transformer_blocks.11.ff_context.net.0.proj.weight, transformer_blocks.3.attn.add_k_proj.bias, transformer_blocks.6.ff_context.net.2.weight, single_transformer_blocks.27.attn.to_k.bias, transformer_blocks.15.attn.to_out.0.weight, single_transformer_blocks.35.proj_out.bias, time_text_embed.guidance_embedder.linear_1.bias, transformer_blocks.2.attn.to_add_out.bias, transformer_blocks.3.attn.add_v_proj.weight, transformer_blocks.6.ff.net.2.weight,

sayakpaul254 days ago

@chuck-ma please try to follow the Colab Notebooks provided in https://hf.co/sayakpaul/flux.1-dev-nf4-pkg. All of them show the correct usage and run without any errors. And when you're facing errors, please try to provide Colab Notebooks so I can verify things. Otherwise, it's hard for me to reproduce errors. Could we do that?

chuck-ma254 days ago (edited 254 days ago)

I think it's better if we can load the transformer that has been quantized instead of quantizing the transformer every time we load it. @sayakpaul

Possible now:
from diffusers import FluxTransformer2DModel

model_id = "sayakpaul/flux.1-dev-nf4-pkg"
model_nf4 = FluxTransformer2DModel.from_pretrained(model_id, subfolder="transformer")
Not sure what made you think it's not possible.

from diffusers import FluxPipeline
import torch
from huggingface_hub import hf_hub_download

repo_name = "ByteDance/Hyper-SD"
ckpt_16steps_name = "Hyper-FLUX.1-dev-8steps-lora.safetensors"


create_fuse_checkp = True

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=dtype,
)

if create_fuse_checkp:
    model_nf4 = FluxTransformer2DModel.from_pretrained(
    # flux_transformer_id,
    # flux_transformer_id,
    model_id, 
    subfolder="transformer", 
    quantization_config=nf4_config,
    torch_dtype=dtype,
)
    pipe.transformer = model_nf4





    pipe.load_lora_weights(
            hf_hub_download(repo_name, ckpt_16steps_name), adapter_name="8_steps_lora"
    )
    pipe.fuse_lora(lora_scale=0.125)
    pipe.transformer.save_pretrained(flux_transformer_id)

If I merge lora and then save the transformer, it will get the error. Otherwise, everything is just fine.

sayakpaul254 days ago👍 1

If I merge lora and then save the transformer, it will get the error. Otherwise, everything is just fine.

I told you here, that LoRA can be tried out later. So, please be aware of the expectations.

chuck-ma254 days ago

If I merge lora and then save the transformer, it will get the error. Otherwise, everything is just fine.

I told you here, that LoRA can be tried out later. So, please be aware of the expectations.

OK. Because I saw that your latest code can support merging lora after loading nf4, I wanted to try whether it is possible to save and load after merging lora.
Anyway, nice job.

bghira254 days ago👍 1

i think it'll be different if you quantise after merge.

SunMarc commented on 2024-09-02

Conversation is marked as resolved

Show resolved

make the unused kwargs warning friendlier.

abc86070

harmonize changes with https://github.com/huggingface/transformers/pu…

31725aa2

style

e5938a63

trainin tests

444588f9

Merge branch 'main' into quantization-config

d3360ce8

SunMarc commented on 2024-09-03

SunMarc253 days ago

Reviewed half of the PR ! I will do the rest soon but since it is mainly config, there shouldn't be any big blockers. Thanks for the PR @sayakpaul ! I left a few comments

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Merge branch 'main' into quantization-config

d8b35f46

sayakpaul253 days ago

but since it is mainly config, there shouldn't be any big blockers.

Did you mean your comments are associated to configs? Understood if that is the case but this PR attempts at adding full-fledged BnB loading support, not just configs.

Merge branch 'main' into quantization-config

859f2d76

sayakpaul commented on 2024-09-04

Conversation is marked as resolved

Show resolved

sayakpaul commented on 2024-09-04

Conversation is marked as resolved

Show resolved

feedback part i.

3b2d6e13

SunMarc commented on 2024-09-04

SunMarc252 days ago

Thanks for this huge work @sayakpaul ! I'll one final review after you added the 8bit tests. This looks very good. I left a few minor comments

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

SunMarc252 days ago❤ 1

Did you mean your comments are associated to configs? Understood if that is the case but this PR attempts at adding full-fledged BnB loading support, not just configs.

I was talking to the rest of the PR but yeah there were also the quantizer + tests. I went through the entire PR now

itsyourlad251 days ago👍 1

I know Lora's aren't a part of this PR but what's the current situation with loading in Lora's on top of nf4 model without dequantize?

It looks like converting to PEFT would work? I assume there is a way to load on top of the quantized model without having to dequantize it because Forge doesn't seem to be doing this.

I figure this place is the best area to leave comment but sorry if slightly off topic of PR.

sayakpaul251 days ago

@itsyourlad no worries. After this PR, the next plan is to add a training script showing how to do LoRAs on NF4 models with PEFT. So, stay tuned.

sayakpaul requested a review from

stevhliu 250 days ago

sayakpaul marked this pull request as ready for review 250 days ago

sayakpaul250 days ago👍 1

@SunMarc ready for another round of review.

@stevhliu could you help review the docs?

Add Flux inpainting and Flux Img2Img (#9135)

5799954d

sayakpaul force pushed from 758d552f to 5799954d 250 days ago

Revert "Add Flux inpainting and Flux Img2Img (#9135)"

8e4bd089

tests

835d4add

don

27075fee

Merge branch 'main' into quantization-config

5c00c1c1

sayakpaul requested a review from

SunMarc 250 days ago

SunMarc approved these changes on 2024-09-06

SunMarc250 days ago

Thanks for iterating @sayakpaul ! LGTM ! It's nice to finally quantization integrated in diffusers !

docs/source/en/quantization/overview.md

	32
	33		## When to use what?
	34
	35		This section will be expanded once Diffusers has multiple quantization backends. Currently, we only support `bitsandbytes`. [This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.

SunMarc250 days ago❤ 2

Yes I think it will be nice to also have a table directly in this doc in the future

sayakpaul250 days ago

@yiyixuxu this is ready for your review.

Merge branch 'main' into quantization-config

5d633a03

stevhliu approved these changes on 2024-09-09

stevhliu247 days ago

Thanks, this looks really good! 🔥

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Apply suggestions from code review

c381fe06

sayakpaul requested a review from

yiyixuxu 247 days ago

Merge branch 'main' into quantization-config

3c92878d

contribution guide.

acdeb254

Merge branch 'main' into quantization-config

aa295b72

Merge branch 'main' into quantization-config

7f7c9cec

Merge branch 'main' into quantization-config

55f96d8e

yiyixuxu commented on 2024-09-13

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/pipelines/pipeline_utils.py

398	400	)
399		if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":
	401	pipeline_has_8bit_bnb_quant = any(_check_bnb_status(module)[-1] for _, module in self.components.items())
	402	if (
	403	not pipeline_has_8bit_bnb_quant

yiyixuxu242 days ago

can you explain why are we adding this not pipeline_has_8bit_quant check here?

src/diffusers/pipelines/pipeline_utils.py

439	454	and str(device) in ["cpu"]
440	455	and not silence_dtype_warnings
441	456	and not is_offloaded
	457	and not is_loaded_in_4bit_bnb

yiyixuxu242 days ago

why do we add this check here?
if the model is in 4bit, the dtype should not be torch.float16 to begin with, no? even if the dtype shows up as torch.float16 somehow, I think the warning still holds i.e. if the weights are in float16, even though we can move it to cpu, we should not

sayakpaul240 days ago

bnb (and many others like torchao) only applies to torch.nn.Linear layers.

module.dtype because we only check for the first parameter here:

diffusers/src/diffusers/models/modeling_utils.py

Line 100 in 55f96d8

def get_parameter_dtype(parameter: torch.nn.Module) -> torch.dtype:

Now, if for a model (like Cog), where the first layer has a Conv (patch embedding layer), bnb won't be applicable here and layer type would be torch.float16, for example and model.dtype would return torch.float16.

if the weights are in float16, even though we can move it to cpu, we should not

So, in the above scenario, all the weights are not in float16, only a tiny fraction is.

This is why I added this check.

But I think your concern is also valid. So, LMK, if, for this PR,

Just remove not is_loaded_in_4bit_bnb and add a note about dtype
Modify the logic of how we determine dtype.

I think option 1 should be okay.

yiyixuxu226 days ago👍 1

ok, I think the scenraio you decribed where the dtype if float16 but it contains 8int - it does not matter here and we should still send this warning regardless

however, I'm more concerned of the opposite scenario where the model contains both float16 and 4-bit/8-bit and dtype shows up as 4-bit/8-bit; in that case we still should send a warning when user try to move it to cpu, but we won't do that here based on current implementation

the solution should be update our get_parameter_dtype to return float point dtype if it is present, so I think option2 here?

sayakpaul216 days ago

I decided to not touch dtype() for now. Instead I performed changes at the pipeline level. Reference commits:

Does this work for you?

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/models/modeling_utils.py

848	960	f"{torch_dtype} needs to be of type `torch.dtype`, e.g. `torch.float16`, but is {type(torch_dtype)}."
849	961	)
850		elif torch_dtype is not None:
	962	elif torch_dtype is not None and hf_quantizer is None:

yiyixuxu241 days ago👍 1

ohhhI think cannot do model.to(torch_dtype) here now that we are supporting _keep_in_fp32_modules - it will just convert all layers to torch_dtype again

I don't think the keep_in_fp32_modules is supported yet in the code path when low_cpu_mem_usage =False - so let's maybe make a error message /warming for that too

sayakpaul240 days ago

Done.

if (low_cpu_mem_usage is None or not low_cpu_mem_usage) and cls._keep_in_fp32_modules is not None:
            low_cpu_mem_usage = True
            logger.info("Set `low_cpu_mem_usage` to True as `_keep_in_fp32_modules` is not None.")

yiyixuxu226 days ago

even for low_cpu_mem_usage is True, we can not do model = model.to(torch_dtype) here anymore, I think we just have to make sure the dtype conversion is handled properly (with keep_in_fp32_modules) in each code path under if low_cpu_mem_usage

a dummy example, the _keep_in_fp32_modules is be ignored here

import torch
from diffusers.models.modeling_utils import ModelMixin
from diffusers.configuration_utils import ConfigMixin

class DummyModel(ModelMixin, ConfigMixin):
    _keep_in_fp32_modules = ["layer2"]
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(10, 20)
        self.layer2 = torch.nn.Linear(20, 30)
        self.layer3 = torch.nn.Linear(30, 40)
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        return x

# Create an instance of the model
model = DummyModel()
model.save_pretrained("dummy_model")
model = DummyModel.from_pretrained("dummy_model", torch_dtype=torch.float16)

sayakpaul216 days ago

I think I have addressed it already. But LMK if you think otherwise.

yiyixuxu212 days ago👍 1

so this example I provided here https://github.com/huggingface/diffusers/pull/9213/files#r1782037619

import torch
from diffusers.models.modeling_utils import ModelMixin
from diffusers.configuration_utils import ConfigMixin

class DummyModel(ModelMixin, ConfigMixin):
    _keep_in_fp32_modules = ["layer2"]
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(10, 20)
        self.layer2 = torch.nn.Linear(20, 30)
        self.layer3 = torch.nn.Linear(30, 40)
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        return x

# Create an instance of the model
model = DummyModel()
model.save_pretrained("dummy_model")
model = DummyModel.from_pretrained("dummy_model", torch_dtype=torch.float16)
print(model.layer2.weight.dtype)

it will print out torch.float16, even though we have _keep_in_fp32_modules = ["layer2"] so the layer2 should be kept in float32, no?

sayakpaul212 days ago

Yes, you're right. 81bb48a works for you?

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/quantizers/quantization_config.py

	211		Additional parameters from which to initialize the configuration object.
	212		"""
	213
	214		_exclude_attributes_at_init = ["_load_in_4bit", "_load_in_8bit", "quant_method"]

yiyixuxu240 days ago

what is this?

sayakpaul240 days ago

Ccing @SunMarc.

We don't need these attributes when initializing a quantization configuration class of BnB. But we need them for subsequent operations.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

	191		for k, v in state_dict.items():
	192		# `startswith` to counter for edge cases where `param_name`
	193		# substring can be present in multiple places in the `state_dict`
	194		if param_name + "." in k and k.startswith(param_name):

yiyixuxu240 days ago

k.split('.')[0] == param_name ?

sayakpaul240 days ago

Do you mean if param_name + "." in k and k.split('.')[0] == param_name:?

yiyixuxu226 days ago

I think if param_name + "." in k and k.startswith(param_name) is same as k.split('.')[0] == param_name
because if k.split('.')[0] == param_name is True -> if param_name + "." in k is also True, not the case?

sayakpaul216 days ago

I changed the code with your suggestion and the assertions failed. I didn't dig deeper and I think it's okay to keep it as is because it's mostly a nit, really.

src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

	275		# Unlike `transformers`, we don't know if we should always keep certain modules in FP32
	276		# in case of diffusion transformer models. For language models and others alike, `lm_head`
	277		# and tied modules are usually kept in FP32.
	278		self.modules_to_not_convert = list(filter(None.__ne__, self.modules_to_not_convert))

yiyixuxu240 days ago

can you provide examples when this list would contain None?

sayakpaul240 days ago

It is configured via llm_int8_skip_modules within the BitsandBytesConfig object. It is defaulted to None in our case because we don't know if there's a requirement of a default unlike language models.

changes

b28cc651

Merge branch 'main' into quantization-config

8328e863

empty

97589423

fix tests

b1a98787

sayakpaul239 days ago

@yiyixuxu thanks for your reviews. I think they were very nice and helpful. I have gone ahead and re-run the tests on audace and everything is green.

I have addressed your comments and made changes. PTAL.

harmonize with https://github.com/huggingface/transformers/pull/33546.

971305b7

numpy_cosine_distance

f41adf1f

sayakpaul requested a review from

yiyixuxu 238 days ago

Merge branch 'main' into quantization-config

0bcb88b1

Merge branch 'main' into quantization-config

55b3696d

chuck-ma233 days ago

Hi, looks like everything is great. Don't know why approving review is still processing.

Merge branch 'main' into quantization-config

4cb3a6d7

Merge branch 'main' into quantization-config

8a03eaee

Merge branch 'main' into quantization-config

53f0a920

Merge branch 'main' into quantization-config

6aab47c0

resolved conflicts,

9b9a6107

yiyixuxu commented on 2024-09-30

Conversation is marked as resolved

Show resolved

src/diffusers/models/modeling_utils.py

308	315	logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
309	316	return
310	317
	318	hf_quantizer = getattr(self, "hf_quantizer", None)
	319	quantization_serializable = (
	320	hf_quantizer is not None and isinstance(hf_quantizer, DiffusersQuantizer) and hf_quantizer.is_serializable
	321	)
	322
	323	if hf_quantizer is not None and not quantization_serializable:
	324	raise ValueError(
	325	f"The model is quantized with {hf_quantizer.quantization_config.quant_method} and is not serializable - check out the warnings from"
	326	" the logger on the traceback to understand the reason why the quantized model is not serializable."
	327	)

yiyixuxu226 days ago

Suggested change

      
                    hf_quantizer = getattr(self, "hf_quantizer", None)
          
                    quantization_serializable = (
          
                        hf_quantizer is not None and isinstance(hf_quantizer, DiffusersQuantizer) and hf_quantizer.is_serializable
          
                    )
          
                    if hf_quantizer is not None and not quantization_serializable:
          
                        raise ValueError(
          
                            f"The model is quantized with {hf_quantizer.quantization_config.quant_method} and is not serializable - check out the warnings from"
          
                            " the logger on the traceback to understand the reason why the quantized model is not serializable."
          
                        )
          
                     if hf_quantizer is not None:
          
                         quantization_serializable = isinstance(hf_quantizer, DiffusersQuantizer) and hf_quantizer.is_serializable)
          
                         if not quantization_serializable:
          
                            raise ValueError(
          
                                f"The model is quantized with {hf_quantizer.quantization_config.quant_method} and is not serializable - check out the warnings from"
          
                                " the logger on the traceback to understand the reason why the quantized model is not serializable."
          
                        )

sayakpaul216 days ago (edited 216 days ago)

Not sure if we can remove hf_quantizer = getattr(self, "hf_quantizer", None).

And need the hf_quantizer is not None check because we access properties from it in the error thrown just two lines below.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

bghira222 days ago👍 2

i've added this into simpletuner and it's a bit funky but it works for training as well with few modifications other than the loading of the base model and casting to dtype change.

bghira221 days ago👍 1

for testing elsewhere, nf4-trained LyCORIS: https://huggingface.co/RareConcepts/FluxDev-LoKr-beavisandbutthead-nf4

note that the inference speed with NF4 is noticeably slower once an adapter is thrown on top

Merge branch 'main' into quantization-config

510d57a4

config_dict modification.

555a5ae8

remove if config comment.

da103650

note for load_state_dict changes.

71316a66

float8 check.

12f5c593

quantizer.

5e722cdd

raise an error for non-True low_cpu_mem_usage values when using quant.

c78dd0cc

low_cpu_mem_usage shenanigans when using fp32 modules.

af3ecea8

don't re-assign _pre_quantization_type.

a473d28d

make comments clear.

870d74f1

remove comments.

3e6cfeb5

handle mixed types better when moving to cpu.

673993ce

add tests to check if we're throwing warning rightly.

0d5f2f7c

better check.

3cb20fe4

fix 8bit test_quality.

10940a94

sayakpaul216 days ago (edited 216 days ago)❤ 1

@yiyixuxu ready for another review. Have run the tests, too and they pass.

Merge branch 'main' into quantization-config

c0a88aee

Merge branch 'main' into quantization-config

dcc5bc5e

Merge branch 'main' into quantization-config

5e0b4eb1

Merge branch 'main' into quantization-config

569dd960

yiyixuxu commented on 2024-10-15

src/diffusers/models/model_loading_utils.py

	201		)
	202		and dtype == torch.float16
	203		):
	204		dtype = torch.float32

yiyixuxu212 days ago👍 1

ohh we should not change dtype here, e.g if it is float16, but we changed it here because we hit a parameter that we need to upcast, but then it would changes the dtype for all the remaining parameters in the state dict too, it should remain float16 when it goes to next loop

we need to just pass float32 to set_module_tensor_to_device if it accepts dtype without changing this variable

sayakpaul212 days ago

Does ff8ddef work for you?

yiyixuxu commented on 2024-10-15

Conversation is marked as resolved

Show resolved

Merge branch 'main' into quantization-config

8bdc8465

handle dtype more robustly.

ff8ddef9

better message when keep_in_fp32_modules.

de6394af

handle dtype casting.

81bb48af

sayakpaul requested a review from

yiyixuxu 212 days ago

Merge branch 'main' into quantization-config

c5e62aef

sayakpaul212 days ago

Very insightful comments, @yiyixuxu! I think I have resolved them all. LMK.

Merge branch 'main' into quantization-config

d023b402

DN6 commented on 2024-10-16

src/diffusers/quantizers/auto.py

	33		}
	34
	35
	36		class DiffusersAutoQuantizationConfig:

DN6210 days ago (edited 210 days ago)

I see this is similar to transformers, but I think the DiffusersAutoQuantConfig class is probably not needed.

This is just a simple mapping to a specific quantization config object. The from_pretrained method in the AutoQuantizer is just wrapping the AutoConfig from_pretrained.

I think we can just move these methods/logic directly into the AutoQuantizer.

sayakpaul210 days ago👍 1

If this is not a must-have, could do this in a follow-up PR.

Merge branch 'main' into quantization-config

a3d26552

yiyixuxu commented on 2024-10-18

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/pipelines/pipeline_utils.py

452	466	and not is_offloaded
453	467	):
454	468	logger.warning(
455		"Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It"
456		" is not recommended to move them to `cpu` as running them will fail. Please make"
457		" sure to use an accelerator to run the pipeline in inference, due to the lack of"
458		" support for`float16` operations on this device in PyTorch. Please, remove the"
	469	"Pipelines loaded with `dtype=torch.float16` and containing modules that have int weights"

yiyixuxu209 days ago

in the if statement is is module.dtype == torch.float16 or module_has_int_weights but here it says "and" - which one it is?
can you explain how is this check (to see if it contains int8 and unit8) different than previous checks is_loaded_in_8bit_bnb etc?

sayakpaul208 days ago

Should be or. Fixed it in ecdf1d0.
is_loaded_in_8bit_bnb simply check for the quantization_method in the quantization config. module_has_int_weights check for the dtype of the module params.

yiyixuxu208 days ago

is_loaded_in_8bit_bnb simply check for the quantization_method in the quantization config. module_has_int_weights check for the dtype of the module params.

i can tell that from the code - I guess I'm looking to understand a scenario where is_loaded_in_8bit_bnb and is_loaded_in_8bit_bnb both false but the weights has these dtypes

sayakpaul208 days ago

Oh okay. module.dtype == torch.float16 can still be True while the module has these types. When?

When the first layer of the concerned model has a convolution layer and since that is not covered by bnb its dtype won't be affected. So, our dtype() property will return torchfloat16, for example.

I think I explained it before already but LMK if it's still not clear.

yiyixuxu208 days ago (edited 208 days ago)

so this warning is meant for any module that contains float16, right? that will include cases:

when all the modules are in float16
the scenrios you described (when it says torch.dtype = float16 but also contains other dtype)

since both scenarios will have dtype==torch.float16, so why do we have to check the [torch.uint8, torch.int8] ?

if we want to make sure we include the cases where dtype does not return torch.float16 but it actually include that dtype, we should check for float16, instead of int8, no?

sayakpaul207 days ago

since both scenarios will have dtype==torch.float16, so why do we have to check the [torch.uint8, torch.int8] ?

if we want to make sure we include the cases where dtype does not return torch.float16 but it actually include that dtype, we should check for float16, instead of int8, no?

I don't think so. For models like SD3 where we can have both float16 and int weights, it should be fine. But for models like Flux where it's purely linear, we won't have ANY float16. This is why module.dtype == torch.float16 or or module_has_int_weights is a better condition to check:

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 462 in 3a91974

if (

yiyixuxu207 days ago

or models like SD3 where we can have both float16 and int weights, it should be fine. But for models like Flux where it's purely linear, we won't have ANY float16.

are you talking about the cases where models contain some float16 but model.dtype won't show as torch.float16, i.e. this case I'm talking about here - shouldn't we check the dtype contains torch.float16, instead of int8?

if we want to make sure we include the cases where dtype does not return torch.float16 but it actually include that dtype, we should check for float16, instead of int8, no?

sayakpaul207 days ago

What is the best way to check:

When a model (after being quantized with bnb) has both float16 params and int params (like SD3)
When a model (after being quantized with bnb) only has int params

My idea is to use module.dtype == torch.float16 or or module_has_int_weights to cover both, if you have a better idea to cover both the scenarios please suggest.

yiyixuxu207 days ago

in another word, you basically need this check (instead of that is if model.dtype=torch.float16 or ... ?

            module_has_fp16_weights = any(
                module
                for _, module in module.named_modules()
                if module.weight.dtype == torch.float16
            )

sayakpaul207 days ago

In case of a model like Flux (after being quantized), module_has_fp16_weights will always evaluate to False. How do we handle that case?

yiyixuxu207 days ago

def module_has_fp16_weights(model):
    for t in model.parameters():
        if t.is_floating_point() and t.dtype == torch.float16:
            return True
    return False

would this work?

yiyixuxu207 days ago

actually dtype conversion is not supported for 4-bit and 8-bit - why do we need to consider them at all here?

sayakpaul207 days ago

Good point!

Then we could keep the entire

diffusers/src/diffusers/pipelines/pipeline_utils.py

Line 449 in 2541d14

if (

under a if not is_loaded_in_8bit or not is_loaded_in_4bit

Does it work for you?

yiyixuxu207 days ago

what would be a situation that not is_loaded_in_8bit or not is_loaded_in_4bit but weights contains float16 and other dtypes?

sayakpaul207 days ago

My bad it won't occur.

yiyixuxu207 days ago👍 1

let's remove that then
I will do more one pass after that, I think we can merge after that

sayakpaul207 days ago (edited 207 days ago)

5d8e844 should resolve this. Thanks for explaining.

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Merge branch 'main' into quantization-config

700b0f3a

fix dtype checks in pipeline.

0ae70fe2

fix warning message.

ecdf1d07

Update src/diffusers/models/modeling_utils.py

aea33981

sayakpaul requested a review from

yiyixuxu 208 days ago

Merge branch 'main' into quantization-config

3a919749

Merge branch 'main' into quantization-config

5d8e8449

mitigate the confusing cpu warning

501a6ba2

ariG23498206 days ago🎉 3

Hi folks!

Thanks for working on this. I was able to run the following script on this branch and generate images on my 8 gigs VRAM laptop

from diffusers import FluxPipeline, FluxTransformer2DModel
from transformers import T5EncoderModel
import torch
import gc


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024


flush()

ckpt_id = "black-forest-labs/FLUX.1-dev"
ckpt_4bit_id = "sayakpaul/flux.1-dev-nf4-pkg"
prompt = "a billboard on highway with 'FLUX under 8' written on it"

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    ckpt_4bit_id,
    subfolder="text_encoder_2",
)

pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    text_encoder_2=text_encoder_2_4bit,
    transformer=None,
    vae=None,
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()


with torch.no_grad():
    print("Encoding prompts.")
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt, prompt_2=None, max_sequence_length=256
    )


pipeline = pipeline.to("cpu")
del pipeline

flush()


transformer_4bit = FluxTransformer2DModel.from_pretrained(ckpt_4bit_id, subfolder="transformer")
pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer_4bit,
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()

print("Running denoising.")
height, width = 512, 768
images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=50,
    guidance_scale=5.5,
    height=height,
    width=width,
    output_type="pil",
).images
images[0].save("output.png")

yiyixuxu approved these changes on 2024-10-20

yiyixuxu206 days ago

let's merge this!

I asked @DN6 to open a follow-up PR for this #9213 (comment),

Merge branch 'main' into quantization-config

1a931cb5

sayakpaul206 days ago

PR merge contingent on #9720.

Merge branch 'main' into quantization-config

2fa8fb91

sayakpaul merged b821f006 into main 206 days ago

sayakpaul deleted the quantization-config branch 206 days ago

DN6 commented on 2024-10-16

src/diffusers/quantizers/quantization_config.py

	159
	160
	161		@dataclass
	162		class BitsAndBytesConfig(QuantizationConfigMixin):

DN6210 days ago

Something to consider. Let's assume you want to use a quantized transformer model in your code. With this naming, you would always need to set up imports in the following way.

from transformers import BitsAndBytesConfig
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig

Not a huge issue. Just giving a heads up incase you want to consider renaming the config to something like DiffusersBitsAndBytesConfig

src/diffusers/models/model_loading_utils.py

	211		set_module_kwargs["dtype"] = dtype
	212
	213		# bnb params are flattened.
	214		if not is_quant_method_bnb and empty_state_dict[param_name].shape != param.shape:

DN6210 days ago

In this situation, aren't we skipping parameter shape checks for bnb loaded weights entirely? What happens when one attempts to load bnb weights but the flattened shape is incorrect?

Perhaps we add a check_quantized_param_shape method to the DiffusersQuantizer base class. And in the BnBQuantizer we can check if the shape matches the rule here:
https://github.com/bitsandbytes-foundation/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L816

src/diffusers/models/model_loading_utils.py

156	217	f"Cannot load {model_name_or_path_str}because {param_name} expected shape {empty_state_dict[param_name]}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
157	218	)
158	219
159		if accepts_dtype:
160		set_module_tensor_to_device(model, param_name, device, value=param, dtype=dtype)
	220	if not is_quantized or (
	221	not hf_quantizer.check_quantized_param(model, param, param_name, state_dict, param_device=device)
	222	):
	223	if accepts_dtype:
	224	set_module_tensor_to_device(model, param_name, device, value=param, **set_module_kwargs)
	225	else:
	226	set_module_tensor_to_device(model, param_name, device, value=param)
161	227	else:
162		set_module_tensor_to_device(model, param_name, device, value=param)
	228	hf_quantizer.create_quantized_param(model, param, param_name, device, state_dict, unexpected_keys)
	229

DN6210 days ago

Small nit. IMO this is a bit more readable

        if is_quantized or hf_quantizer.check_quantized_param(
            model, param, param_name, state_dict, param_device=device
        ):
            hf_quantizer.create_quantized_param(model, param, param_name, device, state_dict, unexpected_keys)
        else:
            if accepts_dtype:
                set_module_tensor_to_device(model, param_name, device, value=param, **set_module_kwargs)
            else:
                set_module_tensor_to_device(model, param_name, device, value=param)

src/diffusers/quantizers/base.py

	134		"""adjust max_memory argument for infer_auto_device_map() if extra memory is needed for quantization"""
	135		return max_memory
	136
	137		def check_quantized_param(

DN6210 days ago👍 1

IMO check_is_quantized_param or check_if_quantized_param more explicitly conveys what this method does.

tests/quantization/bnb/test_4bit.py

	117
	118
	119		class BnB4BitBasicTests(Base4bitTests):
	120		def setUp(self):

DN6210 days ago👍 1

Would clear cache on setup as well.

Ednaordinary193 days ago

It would be useful to rename llm_int8_skip_modules or otherwise make it more clear that it is respected in both 4bit and 8bit mode, as currently the docs sound like skipped modules are only respected in 8 bit mode while the actual implementation suggests otherwise

diffusers/src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

Line 60 in 13e8fde

if self.quantization_config.llm_int8_skip_modules is not None:

sayakpaul193 days ago👍 1

Yeah I think the documentation should reflect this. I guess this is safe to do @SunMarc?

SunMarc191 days ago

Yeah we should do that, would you like to update this @Ednaordinary ? We should also do it in transformers when it gets merged.

Ednaordinary190 days ago

Sure, @SunMarc. I'll make a PR when I'm able. Should I refactor the parameter name and include a deprecation notice, or just include a note in the docs?

MartinRad5269 days ago

@sayakpaul Hi Saya, can you please help on this related to you post. I'm trying to use the quantized Flux model (https://colab.research.google.com/gist/sayakpaul/8fb27a653934c1bc6b013913c346e456/scratchpad.ipynb#scrollTo=YxbQEPd1_Tqf) for FLUX outpainting model (https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev). Here is my script :

import torch
from diffusers import FluxFillPipeline,FluxTransformer2DModel
from diffusers.utils import load_image
from transformers import T5EncoderModel
import gc
image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png")
mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")

nf4_model_id = "hf-internal-testing/flux.1-dev-nf4-pkg"

prompt = "a cute dog in paris photoshoot"

def flush():
"""Wipes off memory."""
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()

def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"

flush()

text_encoder_2 = T5EncoderModel.from_pretrained(
nf4_model_id, subfolder="text_encoder_2", torch_dtype=torch.bfloat16
)
transformer = FluxTransformer2DModel.from_pretrained(
nf4_model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)

pipe = FluxFillPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Fill-dev",
text_encoder_2=text_encoder_2,
transformer=transformer,
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()

image = pipe(
prompt="a white paper cup",
image=image,
mask_image=mask,
height=1632,
width=1232,
guidance_scale=30,
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
print(f"{memory=} GB.")
image.save(f"flux-fill-dev.png")

But I get this error:

462 output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
464 # 3. Save state
465 ctx.state = quant_state

RuntimeError: mat1 and mat2 shapes cannot be multiplied (7854x384 and 64x3072)

sayakpaul69 days ago

You are using the wrong checkpoint for fill. It should be https://huggingface.co/diffusers/FLUX.1-Fill-dev-nf4.

MartinRad5268 days ago

Thank you so much. It worked when I replaced it with yours. In general, what type of machine do you suggest to run the model because I get OOM and/or not enough space issues after the first run? Also, can we quantize other parts of the model?

…

On Thu, Mar 6, 2025 at 8:50 PM Sayak Paul ***@***.***> wrote: You are using the wrong checkpoint for fill. It should be https://huggingface.co/diffusers/FLUX.1-Fill-dev-nf4. — Reply to this email directly, view it on GitHub <#9213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BQGPQWCNRA3ZRVWR45I2E5D2TD3P7AVCNFSM6AAAAABMXDSILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBVGMZTEMBYGE> . You are receiving this because you commented.Message ID: ***@***.***> [image: sayakpaul]*sayakpaul* left a comment (huggingface/diffusers#9213) <#9213 (comment)> You are using the wrong checkpoint for fill. It should be https://huggingface.co/diffusers/FLUX.1-Fill-dev-nf4. — Reply to this email directly, view it on GitHub <#9213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BQGPQWCNRA3ZRVWR45I2E5D2TD3P7AVCNFSM6AAAAABMXDSILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBVGMZTEMBYGE> . You are receiving this because you commented.Message ID: ***@***.***>

	177		kwargs (`dict`, optional):
	178		The keyword arguments that are passed along `_process_model_before_weight_loading`.
	179		"""
	180		model.is_quantized = True
	181		model.quantization_method = self.quantization_config.quant_method

	193		"Make sure to download the latest `bitsandbytes` version. `pip install --upgrade bitsandbytes`."
	194		)
	195
	196		if (param_name + ".quant_state.bitsandbytes__fp4" not in state_dict) and (
	197		param_name + ".quant_state.bitsandbytes__nf4" not in state_dict
	198		):

	348		else:
	349		return None
	350
	351		def to_dict(self) -> Dict[str, Any]:

685	761	subfolder=subfolder or "",
686	762	)
	763	if hf_quantizer is not None:
	764	logger.info("Merged sharded checkpoints as `hf_quantizer` is not None.")
	765	model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)

-                    logger.info("Merged sharded checkpoints as `hf_quantizer` is not None.")
-                    model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)
+                    model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)
+                    logger.info("Merged sharded checkpoints as `hf_quantizer` is not None.")

diffusers
[Quantization] Add quantization support for `bitsandbytes`
#9213

Merged

[Quantization] Add quantization support for `bitsandbytes` #9213

What does this PR do?

Notes

No-frills code snippets

	53
	54		requires_calibration = False
	55		required_packages = None
	56		requires_parameters_quantization = False

	201		else:
	202		param = param.to(dtype)
	203
	204		if not is_quantized and empty_state_dict[param_name].shape != param.shape:

859	970
860	971	return model
861	972
	973	@wraps(torch.nn.Module.cuda)

	976		if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
	977		raise ValueError(
	978		"Calling `cuda()` is not supported for `4-bit` or `8-bit` quantized models. Please use the model as it is, since the"
	979		" model has already been set to the correct devices and casted to the correct `dtype`."

	1207		total_numel = []
	1208		is_loaded_in_4bit = getattr(self, "is_loaded_in_4bit", False)
	1209
	1210		if is_loaded_in_4bit:

	124		warning_msg = (
	125		"You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading"
	126		" already has a `quantization_config` attribute. The `quantization_config` from the model will be used."
	127		)

137	137	return pil_images
	138
	139
	140	def get_module_from_name(module, tensor_name: str) -> Tuple[Any, str]:

	233		self.assertTrue("The module 'SD3Transformer2DModel' has been loaded in `bitsandbytes` 4bit" in cap_logger.out)
	234
	235
	236		class SlowBnb4BitTests(Base4bitTests):

	175		threshold=quantization_config.llm_int8_threshold,
	176		)
	177		has_been_replaced = True
	178		else:
	179		if (
	180		quantization_config.llm_int8_skip_modules is not None
	181		and name in quantization_config.llm_int8_skip_modules
	182		):
	183		pass
	184		else:

	214		return model, has_been_replaced
	215
	216
	217		def replace_with_bnb_linear(model, modules_to_not_convert=None, current_key_name=None, quantization_config=None):

	830		)
	831		if hf_quantizer is None:
	832		param_device = "cpu"
	833		elif is_quant_method_bnb:
	834		param_device = torch.cuda.current_device()

1012	1032	if not isinstance(model, torch.nn.Module):
1013	1033	continue
1014	1034
	1035	# This is because the model would already be placed on a CUDA device.
	1036	if is_loaded_in_8bit_bnb: # is_loaded_in_4bit_bnb or is_loaded_in_8bit_bnb:
	1037	logger.info(
	1038	f"Skipping the hook placement for the {model.__class__.__name__} as it is loaded in `bitsandbytes` 8bit."
	1039	)
	1040	continue
	1041

1009	1025	hook = None
1010	1026	for model_str in self.model_cpu_offload_seq.split("->"):
1011	1027	model = all_model_components.pop(model_str, None)
	1028	is_loaded_in_4bit_bnb, is_loaded_in_8bit_bnb = False, False
	1029	if model is not None and isinstance(model, torch.nn.Module):
	1030	_, is_loaded_in_4bit_bnb, is_loaded_in_8bit_bnb = _check_bnb_status(model)
	1031
1012	1032	if not isinstance(model, torch.nn.Module):
1013	1033	continue
1014	1034
	1035	# This is because the model would already be placed on a CUDA device.
	1036	if is_loaded_in_8bit_bnb: # is_loaded_in_4bit_bnb or is_loaded_in_8bit_bnb:
	1037	logger.info(
	1038	f"Skipping the hook placement for the {model.__class__.__name__} as it is loaded in `bitsandbytes` 8bit."
	1039	)
	1040	continue
	1041

	105		)
	106
	107		if is_8bit:
	108		is_8bit_serializable = version.parse(importlib.metadata.version("bitsandbytes")) > version.parse(

	216
	217		def replace_with_bnb_linear(model, modules_to_not_convert=None, current_key_name=None, quantization_config=None):
	218		"""
	219		A helper function to replace all `torch.nn.Linear` modules by `bnb.nn.Linear8bit` modules from the `bitsandbytes`

861	978
	979	# Taken from `transformers`.
	980	@wraps(torch.nn.Module.cuda)
	981	def cuda(self, args, *kwargs):
	982	# Checks if the model has been loaded in 8-bit
	983	if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
	984	raise ValueError(
	985	"Calling `cuda()` is not supported for `4-bit` or `8-bit` quantized models. Please use the model as it is, since the"
	986	" model has already been set to the correct devices and cast to the correct `dtype`."
	987	)
	988	else:
	989	return super().cuda(args, *kwargs)
	990
	991	# Taken from `transformers`.
	992	@wraps(torch.nn.Module.to)
	993	def to(self, args, *kwargs):
	994	# Checks if the model has been loaded in 8-bit
	995	if getattr(self, "quantization_method", None) == QuantizationMethod.BITS_AND_BYTES:
	996	raise ValueError(
	997	"`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the"
	998	" model has already been set to the correct devices and cast to the correct `dtype`."
	999	)
	1000	return super().to(args, *kwargs)

398	400	)
399		if pipeline_is_sequentially_offloaded and device and torch.device(device).type == "cuda":
	401	pipeline_has_bnb_quant = any(_check_bnb_status(module)[0] for _, module in self.components.items())
	402	if (
	403	not pipeline_has_bnb_quant

	435	if (is_loaded_in_4bit_bnb or is_loaded_in_8bit_bnb) and dtype is not None:
426	436	logger.warning(
427		f"The module '{module.__class__.__name__}' has been loaded in 8bit and conversion to {dtype} is not yet supported. Module is still in 8bit precision."
	437	f"The module '{module.__class__.__name__}' has been loaded in `bitsandbytes` {precision} and conversion to {dtype} is not supported. Module is still in {precision} precision. In most cases, it is recommended to not change the precision."

	440	if is_loaded_in_8bit_bnb and device is not None:
431	441	logger.warning(
432		f"The module '{module.__class__.__name__}' has been loaded in 8bit and moving it to {dtype} via `.to()` is not yet supported. Module is still on {module.device}."
	442	f"The module '{module.__class__.__name__}' has been loaded in `bitsandbytes` {precision} and moving it to {device} via `.to()` is not supported. Module is still on {module.device}. In most cases, it is recommended to not change the device."

	987		"Calling `cuda()` is not supported for `8-bit` quantized models. "
	988		" Please use the model as it is, since the model has already been set to the correct devices."
	989		)
	990		elif is_bitsandbytes_version("<", "0.43.2"):
	991		raise ValueError(
	992		"Calling `cuda()` is not supported for `4-bit` quantized models with the installed version of bitsandbytes. "
	993		f"The current device is `{self.device}`. If you intended to move the model, please install bitsandbytes >= 0.43.2."
	994		)

	55		"""
	56
	57		use_keep_in_fp32_modules = True
	58		requires_parameters_quantization = True

	104		)
	105
	106		def adjust_target_dtype(self, target_dtype: "torch.dtype") -> "torch.dtype":
	107		if version.parse(importlib.metadata.version("accelerate")) > version.parse("0.19.0"):

	58		requires_parameters_quantization = True
	59		requires_calibration = False
	60
	61		required_packages = ["bitsandbytes", "accelerate"]

	234		torch_dtype = torch.float16
	235		return torch_dtype
	236
	237		# (sayakpaul): I think it could be better to disable custom `device_map`s
	238		# for the first phase of the integration in the interest of simplicity.
	239		# Commenting this for discussions on the PR.
	240		# def update_device_map(self, device_map):
	241		# if device_map is None:
	242		# device_map = {"": torch.cuda.current_device()}
	243		# logger.info(
	244		# "The device_map was not initialized. "
	245		# "Setting device_map to {'':torch.cuda.current_device()}. "
	246		# "If you want to use the model for inference, please set device_map ='auto' "
	247		# )
	248		# return device_map

diffusers [Quantization] Add quantization support for `bitsandbytes` #9213 Merged

[Quantization] Add quantization support for `bitsandbytes` #9213

What does this PR do?

Notes

No-frills code snippets

diffusers
[Quantization] Add quantization support for `bitsandbytes`
#9213

Merged

	296
	297		@property
	298		def is_serializable(self):
	299		_is_4bit_serializable = version.parse(importlib.metadata.version("bitsandbytes")) >= version.parse("0.41.3")

	329		return model
	330
	331
	332		class BnB8BitDiffusersQuantizer(DiffusersQuantizer):
	333		"""
	334		8-bit quantization from bitsandbytes quantization method:
	335		before loading: converts transformer layers into Linear8bitLt during loading: load 16bit weight and pass to the
	336		layer object after: quantizes individual weights in Linear8bitLt into 8bit at fitst .cuda() call
	337		saving:
	338		from state dict, as usual; saves weights and 'SCB' component
	339		loading:
	340		need to locate SCB component and pass to the Linear8bitLt object
	341		"""
	342
	343		use_keep_in_fp32_modules = True
	344		requires_parameters_quantization = True
	345		requires_calibration = False
	346
	347		required_packages = ["bitsandbytes", "accelerate"]
	348
	349		def __init__(self, quantization_config, **kwargs):
	350		super().__init__(quantization_config, **kwargs)
	351

	42		logger = logging.get_logger(__name__)
	43
	44
	45		def set_module_quantized_tensor_to_device(module, tensor_name, device, value=None, quantized_stats=None):
	46		"""
	47		A helper function to set a given tensor (parameter of buffer) of a module on a specific device (note that doing
	48		`param.to(device)` creates a new tensor not linked to the parameter, which is why we need this function). The
	49		function is adapted from `set_module_tensor_to_device` function from accelerate that is adapted to support the
	50		class `Int8Params` from `bitsandbytes`.
	51
	52		Args:
	53		module (`torch.nn.Module`):
	54		The module in which the tensor we want to move lives.
	55		tensor_name (`str`):
	56		The full name of the parameter/buffer.
	57		device (`int`, `str` or `torch.device`):
	58		The device on which to set the tensor.
	59		value (`torch.Tensor`, optional):
	60		The value of the tensor (useful when going from the meta device to any other device).
	61		quantized_stats (`dict[str, Any]`, optional):
	62		Dict with items for either 4-bit or 8-bit serialization
	63		"""
	64		# Recurse if needed
	65		if "." in tensor_name:
	66		splits = tensor_name.split(".")
	67		for split in splits[:-1]:
	68		new_module = getattr(module, split)

	13
	14		# Quantization
	15
	16		Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes).

	13
	14		# bitsandbytes
	15
	16		[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.

-[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
+[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.
+-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.

	22		pip install diffusers transformers accelerate bitsandbytes -U
	23		```
	24
	25		Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~ModelMixin.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.

	57		model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype
	58		```
	59
	60		Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights.

	104		model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype
	105		```
	106
	107		You can simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command.

	You can simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command.
	Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].