fabiorigano1 year ago❤ 4👀 1

What does this PR do?

Fixes #6802

Who can review?

@yiyixuxu @asomoza

Add attention masking to attn processors

063e0856

fabiorigano1 year ago (edited 1 year ago)

it is a work in progress, I am not satisfied with the results (maybe I am doing something wrong).

Mask preprocessing is done outside of the PR. I extract masks from a RGB image, after selecting unique colors and discarding the background (black). Here it is a code snippet to get the list of masks from the following image:

import torch
import diffusers
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from diffusers.utils import load_image

noise_scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
    steps_offset=1
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "SG161222/Realistic_Vision_V4.0_noVAE",
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    feature_extractor=None,
    safety_checker=None
).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin")
pipeline.set_ip_adapter_scale(0.7)


# Load image
mask = load_image("./mask.png")
# Use image processor registered in the pipeline
iproc = pipeline.image_processor
mask = iproc.pil_to_numpy(mask)[0]
# Find unique colors
colors = np.unique(mask.reshape(-1, 3), axis=0)
# Discard background
unique = [colors[i] for i in range(colors.shape[0]) if np.all(colors[i] != np.zeros(3))]
# Extract masks
masks = [np.expand_dims(np.where(mask==u, 1,0)[:, :, 0], axis=0) for u in unique]
masks = [iproc.numpy_to_pt(mask)[0] for mask in masks]

Input images are:

https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ai_face2.png

https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png

Then I called the pipeline as it follows:

generator = torch.Generator(device="cpu").manual_seed(33)
num_images=1

images = pipeline(
      prompt="A photo of two girls wearing black dresses, holding red roses in hand, upper body, behind is the Eiffel Tower",
      ip_adapter_image=[[image1, image2]],
      negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
      num_inference_steps=20, num_images_per_prompt=num_images, width=704, height=512,
      generator=generator, cross_attention_kwargs={"masks": masks},
      #output_type= "np"
  ).images

Result without masks:

Result with masks:

asomoza1 year ago❤ 5

Nice work, you're doing the one use case that I didn't code which is IP Adapters with multiple images and multiple masks, but is the same as two IP Adapters with one image and one mask for each one with the added benefit that you can manage the weight of each one separately, so in my tests it would be like this:

Result 1	Result 2

I use SDXL only, but they should be comparable. I really recommend that you don’t use multiple masks for multiple images and instead use one mask per IP Adapter. I haven’t seen someone using this, but I could be wrong.

The problem you see in your example is more noticeable with SDXL:

Result 1	Result 2

What's happening is that you're matching the batch with the masks, but the batch, depending on the classifier free guidance is * 2 or not, so what you're really doing is applying only one mask if the negative prompt is empty or deleting one if the CFG is less than 1. Also you're applying the mask to the ip_hidden_states of multiple images, so you can also see that the faces are combined into one where the mask is applied.

There's some more minor issues but I'll wait and see which approach you use.

Move latent image masking

d753eec6

fabiorigano1 year ago (edited 1 year ago)👍 1

hi @asomoza, thanks for the suggestion, I updated the for loop and now results look pretty good.

Also you're applying the mask to the ip_hidden_states of multiple images, so you can also see that the faces are combined into one where the mask is applied.

I am not sure about what you mean here. The image after the mask in the first comment is the result of generation without applying masks, so it is correct to have a combination of the two faces.

I changed the base SD model and loaded two IP-Adapters to the pipeline:

pipeline = AutoPipelineForText2Image.from_pretrained(
    "frankjoshua/realisticVisionV51_v51VAE",
    torch_dtype=torch.float16,
    scheduler=noise_scheduler,
    feature_extractor=None,
    safety_checker=None
).to("cuda")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name=["ip-adapter-plus-face_sd15.bin", "ip-adapter-full-face_sd15.bin"])

pipeline.set_ip_adapter_scale([0.7, 0.7])

generator = torch.Generator(device="cpu").manual_seed(33)
num_images=4

images = pipeline(
      prompt="A photo of two girls wearing black dresses, holding red roses in hand, upper body, behind is the Eiffel Tower",
      ip_adapter_image=[[image1], [image2]],
      negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
      num_inference_steps=20, num_images_per_prompt=num_images, width=704, height=512,
      generator=generator, cross_attention_kwargs={"masks": masks}
  ).images

Output

@yiyixuxu

asomoza1 year ago👍 2

yeah, now is working ok, nice work.

I am not sure about what you mean here. The image after the mask in the first comment is the result of generation without applying masks, so it is correct to have a combination of the two faces.

I meant the one after where there was supposed to be one face for each woman, also you can see it in my results, that's because there were multiple images for one IP Adapter and you were applying one mask to those.

You don't have that problem now and it doesn't matter anymore since you're using two IP Adapters, but the equivalent would be if you do this:

images = pipeline(
      prompt="A photo of two girls wearing black dresses, holding red roses in hand, upper body, behind is the Eiffel Tower",
      ip_adapter_image=[[image1, image2], [image2]],
      negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
      num_inference_steps=20, num_images_per_prompt=num_images, width=704, height=512,
      generator=generator, cross_attention_kwargs={"masks": masks}
  ).images

The results are like this:

[[image1], [image2]]	[[image1, image2], [image2]]

I know they're similar but I can see the difference instantly since I've done a million of tests with IP Adapters.

asomoza commented on 2024-02-05

src/diffusers/models/attention_processor.py

2348	2443
2349		# the output of sdp = (batch, num_heads, seq_len, head_dim)
2350		# TODO: add support for attn.scale when we move to Torch 2.1
2351		current_ip_hidden_states = F.scaled_dot_product_attention(
2352		query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
2353		)
	2444	mask_downsample = mask_downsample.to(query.dtype).to(current_ip_hidden_states.device)
2354	2445
2355		current_ip_hidden_states = current_ip_hidden_states.transpose(1, 2).reshape(
2356		batch_size, -1, attn.heads * head_dim
2357		)
2358		current_ip_hidden_states = current_ip_hidden_states.to(query.dtype)
	2446	current_ip_hidden_states = current_ip_hidden_states * mask_downsample

asomoza1 year ago

this throws an error if you use a mask with different width and height than the generated image, for example if I use your mask with SDXL and generate a 1024x1024 image I get this error:

The size of tensor a (4096) must match the size of tensor b (4070) at non-singleton dimension 1

fabiorigano1 year ago

I know, I didn't add checks on mask size yet. I think also ComfyUI implementation has the same issue, but I haven't tested it
https://github.com/cubiq/ComfyUI_IPAdapter_plus/blob/90d3451cd970d5aa9cac55224e24a7c7fd98d253/IPAdapterPlus.py#L537

asomoza1 year ago

I think it works with masks that aren't of the same ratio as the generation, is just not recommended. Maybe @cubiq can provide his insights here, I use the same code and it doesn't use the ratio, I think the need checking means that he wasn't completely sure of the formula used.

fabiorigano1 year ago👍 1

Ok, I will check without ratio as in the other implementation! Thanks

cubiq1 year ago👍 2

the attention mask is resized and stretched at each iteration, the aspect ratio doesn't matter but of course it's better if you provide the right size.

due to rounding error it might happen that you get the wrong size, but it's not very common and I think I have a solution for that already.

fabiorigano1 year ago

I can confirm the issue is still there also with the other implementation

asomoza1 year ago

In that case I really don't know what should be the best method of doing this that's consistent with diffusers.

In my case I prepare the mask latents outside the attention processor with the vae scale factor and the width and height of the generated image but it could be as simple as throwing an error telling the user that the masks must have the same aspect ratio than the generated image.

Remove redundant code

60336baa

Fix removed line

f6451d3e

yiyixuxu1 year ago👍 1❤ 1

Great work! Thanks everyone here ❤️ the results look super cool to me!
Can we confirm that it works correctly as long as we only pass one image and one mask for each ip-adapter? @asomoza @fabiorigano

so the remaining item is:

the resizing #6847 (comment)
refactor the code

asomoza1 year ago👍 1

yes, it works correctly but with one or multiple prompt images and one mask per IP Adapter which IMO is the correct implementation.

There's one other issue that maybe should be addressed but I don't know if it's from this PR or comes from before, but if you don't pass the same number of scales it completely ignores the IP adapters that don't have scales without showing a message or error.

yiyixuxu1 year ago👍 1

@asomoza I fixed here #6884

yiyixuxu commented on 2024-02-07

yiyixuxu1 year ago❤ 1

super cool!

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/diffusers/models/attention_processor.py

	2194		if len(masks) != len(ip_hidden_states):
	2195		raise ValueError(
	2196		f"Number of masks ({len(masks)}) must match number of IP-Adapters ({len(self.scale)})"
	2197		)

yiyixuxu1 year ago👍 1

from what I understand, it only works when we pass 1 image / 1 mask /1 ip-adapter?, if so, let's check the number of images here and throw an error if multiple image are passed

    if ip_hidden_states[0].shape[1] > 1: 
            raise ValueError("...."

asomoza1 year ago

Why do you think that? If you perform that check, you will remove all the instant lora functionality.

yiyixuxu1 year ago

it's only when mask is not None though - you can still use multiple images without mask
and it's only based on the understanding that we can only use one image/one mask/one ip-adapter when we use mask, no?

yiyixuxu1 year ago👍 1

if it works with multiple images for sure we don't need this!

asomoza1 year ago (edited 1 year ago)👍 1

it works with multiple images, I tested it, so the only check should be that the number of masks matches the number of ip adapters.

sayakpaul1 year ago👍 1

Are we covering these cases in the tests?

src/diffusers/models/attention_processor.py

	2214		current_ip_hidden_states = current_ip_hidden_states.to(query.dtype)
	2215
	2216		if mask is not None:
	2217		seq_len = current_ip_hidden_states.shape[1]
	2218		o_h = masks[0].shape[1]
	2219		o_w = masks[0].shape[2]
	2220		ratio = o_w / o_h
	2221		mask_h = int(torch.sqrt(torch.tensor(seq_len / ratio)))
	2222		mask_h = int(mask_h) + int((seq_len % int(mask_h)) != 0)
	2223		mask_w = seq_len // mask_h
	2224
	2225		if len(mask.shape) == 2:
	2226		mask = mask.unsqueeze(0)
	2227		mask_downsample = F.interpolate(
	2228		torch.tensor(mask, dtype=torch.float32).unsqueeze(0), size=(mask_h, mask_w), mode="bicubic"
	2229		).squeeze(0)
	2230
	2231		if mask_downsample.shape[0] < batch_size:
	2232		mask_downsample = mask_downsample.repeat(batch_size, 1, 1)
	2233		if mask_downsample.shape[0] > batch_size:
	2234		mask_downsample = mask_downsample[:batch_size, :, :]
	2235
	2236		mask_downsample = mask_downsample.view(mask_downsample.shape[0], -1, 1).repeat(
	2237		1, 1, current_ip_hidden_states.shape[-1]
	2238		)
	2239
	2240		mask_downsample = mask_downsample.to(query.dtype).to(current_ip_hidden_states.device)

yiyixuxu1 year ago👍 1

let's move this code to VaeImageProcessor, https://github.com/huggingface/diffusers/blob/main/src/diffusers/image_processor.py

maybe we can create a IPAdapterMaskProcessor(VaeImageProcessor) and add a downsample method

Suggested change

      
                            seq_len = current_ip_hidden_states.shape[1]
          
                            o_h = masks[0].shape[1]
          
                            o_w = masks[0].shape[2]
          
                            ratio = o_w / o_h
          
                            mask_h = int(torch.sqrt(torch.tensor(seq_len / ratio)))
          
                            mask_h = int(mask_h) + int((seq_len % int(mask_h)) != 0)
          
                            mask_w = seq_len // mask_h
          
                            if len(mask.shape) == 2:
          
                                mask = mask.unsqueeze(0)
          
                            mask_downsample = F.interpolate(
          
                                torch.tensor(mask, dtype=torch.float32).unsqueeze(0), size=(mask_h, mask_w), mode="bicubic"
          
                            ).squeeze(0)
          
                            if mask_downsample.shape[0] < batch_size:
          
                                mask_downsample = mask_downsample.repeat(batch_size, 1, 1)
          
                            if mask_downsample.shape[0] > batch_size:
          
                                mask_downsample = mask_downsample[:batch_size, :, :]
          
                            mask_downsample = mask_downsample.view(mask_downsample.shape[0], -1, 1).repeat(
          
                                1, 1, current_ip_hidden_states.shape[-1]
          
                            )
          
                            mask_downsample = mask_downsample.to(query.dtype).to(current_ip_hidden_states.device)
          
                            mask_downsample = IPAdapterMaskProcessor.downsample(mask, seq_length, batch_size)
          
                            mask_downsample = mask_downsample.to(query.dtype).to(current_ip_hidden_states.device)

Add padding

bf4eb1dd

fabiorigano1 year ago (edited 1 year ago)❤ 1

so the remaining item is:

the resizing [WIP] IP-Adapter attention masking #6847 (comment)

refactor the code

I have just added padding to fix the resizing bug, I see output is still good.
maybe it is better to recommend using masks with aspect ratio equal or very close to that of the output images, but avoiding generating errors if there is a mismatch.
I will finish refactoring as suggested after work :)

yiyixuxu commented on 2024-02-07

Conversation is marked as resolved

Show resolved

src/diffusers/models/attention_processor.py

2186	2189	hidden_states = attn.batch_to_head_dim(hidden_states)
2187	2190
	2191	if masks is not None:
	2192	if not isinstance(masks, list):
	2193	masks = [masks]

yiyixuxu1 year ago

Suggested change

      
                        if not isinstance(masks, list):
          
                            masks = [masks]
          
                        if not isinstance(masks,np.ndarray) or mask.ndim != 4:
          
                            raise ValueError(" ip_adapter_mask should be a numpy array with shape num_ip_adapter, 1, height, width. Please use `IPAdapterMaskProcessor` to preprocess your mask")

let's enforce the masks to be a numpy array with shape [num_ip_adapter, 1, height, width]

In order to be able to do so, let's create a IPAdapterMaskProcessor that inherits from VaeImageProcessor https://github.com/huggingface/diffusers/blob/main/src/diffusers/image_processor.py#L444
and user can get their mask like this

mask_processor = IPAdapterMaskProcessor()
masks = mask_processor.process([mask1,mask2])

yiyixuxu1 year ago (edited 1 year ago)

As a reference, this is how I use the VaeImageProcessor to process our mask;

import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image
from diffusers.image_processor import VaeImageProcessor

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(
  "h94/IP-Adapter", 
  subfolder="sdxl_models", 
  weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] * 2
)
pipeline.set_ip_adapter_scale([0.7] * 2)
pipeline.enable_model_cpu_offload()

face_image1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png")
mask1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png")

mask_processor = VaeImageProcessor(do_normalize=False, do_binarize=True, do_convert_grayscale=True)
masks = list(mask_processor.preprocess([mask1, mask2]))

generator = torch.Generator(device="cpu").manual_seed(0)

image = pipeline(
    prompt="2 girls",
    ip_adapter_image=[face_image1, face_image2],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50, num_images_per_prompt=1,
    generator=generator,
    cross_attention_kwargs={"masks": masks}
).images[0]

yiyixuxu1 year ago❤ 1

also IPAdapterMaskProcessor should only need to handle mask that has one color, e.g. the masks like this

And it does not need to be able to handle this kind of mask

Apply suggestions from code review

37419f13

Add IPAdapterMaskProcessing

a180e258

fabiorigano1 year ago (edited 1 year ago)

Updated snippet to run inference:

from diffusers import AutoPipelineForText2Image, DDIMScheduler
import torch
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
from diffusers.image_processor import IPAdapterMaskProcessor

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)

face_image1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png")
mask1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png")

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2])

ip_images =[[image1], [image2]]

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] * 2)
pipeline.set_ip_adapter_scale([0.7, 0.7])
generator = torch.Generator(device="cpu").manual_seed(1)
num_images=1

images = pipeline(
    prompt="2 girls",
    ip_adapter_image=ip_images,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=20, num_images_per_prompt=num_images, 
    generator=generator, cross_attention_kwargs={"ip_adapter_masks": masks}
).images

Output:

Fix return types

c6fddaed

Update image_processor

708e0ebb

Add test

bbfeb676

Merge branch 'main' into ipadaptermasks

e11bb7be

fabiorigano1 year ago

@yiyixuxu can you give a look when you have time please?

I added a test, while I didn't touch documentation because I saw there is a big refactoring going on right now

thanks :)

fabiorigano changed the title ~~[WIP] IP-Adapter attention masking~~ IP-Adapter attention masking 1 year ago

yiyixuxu commented on 2024-02-09

yiyixuxu1 year ago❤ 1

ohh looking great!
left a few nits

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

yiyixuxu1 year ago

also I think if we merge this #6915 (comment)
we won't need to add the additional ip_adapter_mask argument to the default attention processors

and also this test PR is relevant too, we will wait for it to merge and update the test #6888

the doc can be added later

yiyixuxu1 year ago

cc @asomoza
can you do a final review too?

yiyixuxu requested a review from

DN6 1 year ago

yiyixuxu requested a review from

sayakpaul 1 year ago

HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.