zucchini-nlp357 days ago (edited 300 days ago)🚀 2

What does this PR do?

Fixes #30809, This PR moves the _merge_inputs_with_vision_embeds to the processing logics, and thus making VLMs more versatile in terms of generation strategies. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.

The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in transformers.

let it be

050657f7

draft

a67087e4

should not have changed

1e2b873a

zucchini-nlp commented on 2024-05-22

src/transformers/models/llava_next/processing_llava_next.py

114	145
115	146	return BatchFeature(data={text_inputs, image_inputs})
116	147
	148	def _get_number_of_features(self, height: int, width: int) -> int:

zucchini-nlp357 days ago

Mostly copied from TGI with minor changes in calculations for unpadding, otherwise it won't work for low resolution images

HuggingFaceDocBuilderDev357 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap355 days ago❤ 1

Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling

add warnings

70145d46

Merge remote-tracking branch 'upstream/main' into vlm_processors

16a67875

fix & add tests

8472035b

fix tests

13af9e84

zucchini-nlp350 days ago

@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks

What changed:

Model can generate from both: old inputs and new-expanded inputs. If it's old inputs, warning is raised, asking to upgrade the processor config.
Processor also can return both types. If it has all the necessary parts to calculate image embedding length, the inputs will be expanded. Otherwise, warning is raised and old behavior retained.
Old behavior is planned to be totally removed in v4.44 (or better v4.43?)
Added tests to check that old vs new inputs generation is identical
To actually have llava-based models work in new style, I'll later update all hf-llava configs in the hub. Other models in the hub will continue to work with old behavior

zucchini-nlp requested a review from

amyeroberts 350 days ago

ipnuts embeds cannot be passed with pixels

41d086f1

zucchini-nlp marked this pull request as ready for review 350 days ago

amyeroberts commented on 2024-06-03

amyeroberts345 days ago

Thanks for working on this - will be great to have some of this logic unified!

Main comment is about how we set the required arguments for processing in the processor

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

more updates

bf59ed69

paligemma ready!

020e7ed1

minor typos

3e0455c8

update blip-2

674f16e6

fix tests & raise error

42ae6464

Merge branch 'main' into vlm_processors

b5259f2d

docstring

a6c50deb

add blip2 test

4766e2ea

Merge branch 'main' into vlm_processors

d46df906

zucchini-nlp338 days ago (edited 337 days ago)

@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).

warning text is more clear and it's easy for users to add new attributes to Processor class (with processor.patch_size = patch_size)
BLIP-2 needed more modifications as it didn't have special image token, lmk if the way I did works
Paligemma worked out of the box but needed changes for causal mask. There's also smth weird with "position_ids" which will be fixed by @molbap
All models have their "old-new format equivalence" tests and are passing locally. I don't know how to make happy the failing doctest, it's red even after I deprecated the unused attribute

zucchini-nlp requested a review from

amyeroberts 338 days ago

tmp

f74297bb

amyeroberts added run-slow

amyeroberts commented on 2024-06-13

amyeroberts335 days ago👀 1

Wow - a big piece of work!

Overall looks good to me, just a few comments here and there. I'd like to have a second review from @molbap and a run on the slow tests for all the models touched here

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/transformers/models/blip_2/modeling_blip_2.py

	1898		special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
	1899		inputs_embeds[special_image_mask] = language_model_inputs.flatten()
	1900		else:
	1901		logger.warning_once(

amyeroberts331 days ago👍 1

We should make sure the official checkpoints have been updated this way

add image seq length to config

5fc8565c

update docstring

1b4674ad

Merge branch 'main' into vlm_processors

c3c130bc

delete

84388756

fix tests

bf9e637b

fix blip

db1fa4f3

molbap commented on 2024-06-20

molbap328 days ago

Will continue reviewing (biig work) but sharing feedback on the 2 first things I tested, Llava and Paligemma

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/transformers/models/llava/modeling_llava.py

420	434	inputs_embeds = self.get_input_embeddings()(input_ids)
421	435
422		# 2. Merge text and images
423		if pixel_values is not None and input_ids.shape[1] != 1:
424		image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
425		# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
426		selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
	436	# if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
	437	# not very reliable, but we don't expect one to actually pass 500+ images for one prompt

molbap328 days ago

couldn't we also have a very tiny image embeddings sequence length?

zucchini-nlp328 days ago👍 1

For LLaVa models the usual backbone is CLIP (i don't remember which ckpt exactly), which always give 576 length. In case some LLaVas have another vision backbone, I guess user have to switch the image_seq_length parameter in model's config.

I am thinking if we can catch that and give user an informative error msg? Will explore more this issue

fix paligemma

246b06a1

zucchini-nlp added WIP

merge `main`

222bf9a6

out-of-place scatter

54862156

add llava-next-video

78c44844

molbap requested a review from

molbap 287 days ago

molbap commented on 2024-08-05

molbap282 days ago

It's a biiig piece of work, nicely done, tests and all! I left a few comments on some things I didn't understand well + paligemma masking in particular

src/transformers/models/blip_2/modeling_blip_2.py

	1780		logger.warning_once(
	1781		"Expanding inputs for image tokens in BLIP-2 should be done in processing. "
	1782		"Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. "
	1783		"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."

molbap282 days ago

Maybe this, or another version number as we're in 4.44 dev version

Suggested change

      
                            "Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
          
                            "Using processors without these attributes in the config is deprecated and will throw an error in a later version"

zucchini-nlp282 days ago👍 1

Yes, will update accordingly when we get one step away from merging. I think two-three major versions from current one will work :)

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/transformers/models/llava_next_video/diff_llava_next_video.py

	421		"with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
	422		"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
	423		)
	424		if input_ids.shape[1] != 1:
	425		if image_features is not None:
	426		inputs_embeds = inputs_embeds.to(image_features.dtype)
	427		(
	428		inputs_embeds,
	429		attention_mask,
	430		position_ids,
	431		labels,
	432		input_ids,
	433		) = self._merge_input_ids_with_image_features(
	434		image_features,
	435		feature_lens,
	436		inputs_embeds,
	437		input_ids,
	438		attention_mask,
	439		position_ids,
	440		labels=labels,
	441		image_token_index=self.config.image_token_index,
	442		)
	443		if video_features is not None:
	444		(
	445		inputs_embeds,
	446		attention_mask,
	447		position_ids,
	448		labels,
	449		input_ids,
	450		) = self._merge_input_ids_with_image_features(
	451		video_features,
	452		video_feature_lens,
	453		inputs_embeds,
	454		input_ids,
	455		attention_mask,
	456		position_ids,
	457		labels=labels,
	458		image_token_index=self.config.video_token_index,
	459		)

molbap282 days ago

some repeated code here - wonder if we can refactor here a bit to have only one call to _merge visible, maybe in a comprehension

zucchini-nlp282 days ago

Hmm let me see. Actually I wanted to go with one merge call for video-image models but iirc something was off with that method. I guess it was way too much logic to handle edge cases

src/transformers/models/llava_next_video/modeling_llava_next_video.py

molbap282 days ago

same comment on _merge method as in modeling_llava_next!

Conversation is marked as resolved

Show resolved

src/transformers/models/paligemma/modeling_paligemma.py

	304	target_length = cache_position[-1] + 1
	305	causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
320	306
321		if token_type_ids is not None and labels is not None:
322		# we are training thus we need to create a full mask on the image + prefix but causal on suffix
323		target_length = cache_position[-1] + 1
324		causal_mask = torch.full(
325		(sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
326		)
327		if sequence_length != 1:
	307	# do causal diagonal mask only if training, otherwise attend to the whole prefix

molbap282 days ago👀 1

I don't think that's accurate: even in training, paligemma fully attends to the prefix in a non-causal way, and is causal only for the decoded part. See

zucchini-nlp281 days ago

Just verified again, after rebasing the latest main that the causal masks for prev and current versions are identical. So iiuc, the mask is full in inference stage, which is how it works currently, The only thing that is masked is padding token.

In training, the image you attached is exactly how it works, with the only exception that the current token can attend to itself (as it's impl in main branch). The image for some reason says we shouldn't attend.

Will push soon the changes after rebasing main

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

tests/models/video_llava/test_modeling_video_llava.py

	593
	594		@slow
	595		@require_bitsandbytes
	596		def test_expansion_in_processing(self):
	597		model_id = "LanguageBind/Video-LLaVA-7B-hf"
	598		model = VideoLlavaForConditionalGeneration.from_pretrained(model_id, load_in_4bit=True)
	599		processor = VideoLlavaProcessor.from_pretrained(model_id)
	600
	601		prompt = "USER: <video>Describe the video in details. ASSISTANT:"
	602		video_file = hf_hub_download(
	603		repo_id="raushan-testing-hf/videos-test", filename="video_demo.npy", repo_type="dataset"
	604		)
	605		video_file = np.load(video_file)
	606
	607		# check processing with expansion of inputs
	608		processor.vision_feature_select_strategy = "default"
	609		processor.patch_size = 14
	610		inputs_expanded = processor(prompt, videos=video_file, return_tensors="pt").to(torch_device, torch.float16)
	611		self.assertTrue(inputs_expanded.input_ids.shape[-1] == 2073)
	612
	613		# check processing without expansion of inputs (legacy behavior)
	614		processor.vision_feature_select_strategy = None
	615		processor.patch_size = None
	616		inputs = processor(prompt, videos=video_file, return_tensors="pt").to(torch_device, torch.float16)
	617		self.assertTrue(inputs.input_ids.shape[-1] == 18)
	618
	619		# generate exactly 20 tokens
	620		output = model.generate(**inputs, min_new_tokens=20, max_new_tokens=20)
	621		output_expanded = model.generate(**inputs_expanded, min_new_tokens=20, max_new_tokens=20)
	622
	623		# check that both inputs are handled correctly and generate the same output
	624		self.assertListEqual(output_expanded[:, -20:].tolist(), output[:, -20:].tolist())

molbap282 days ago

Really cool to have systematic testing for expansion 🚀 - prompts formats differ from model to model, but the logic is always the same, wonder if we could have a common testing method here

zucchini-nlp282 days ago🚀 1

Yes, we could have a common testing method. Adding VLMTesterMixin which expands on GenerationTesterMixin is the second thing we need after this PR is merged

Update src/transformers/models/blip_2/modeling_blip_2.py

d60624ef

remove tmp

1973b39f

merge `main`

a6e380fe

codestyle

8e88d8bb

nits

689eed93

more nits

28e80546

remove overriding in tests

637e5142

comprehension when merging video

be939d88

fix-copies

232eb7c9

revert changes for embeds test

385a6174

fix tests after making comprehension

4831a7e6

molbap commented on 2024-08-06

Conversation is marked as resolved

Show resolved

src/transformers/models/llava_next_video/diff_llava_next_video.py

422	422	"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
423	423	)
424	424	if input_ids.shape[1] != 1:
425		if image_features is not None:
426		inputs_embeds = inputs_embeds.to(image_features.dtype)
427		(
428		inputs_embeds,
429		attention_mask,
430		position_ids,
431		labels,
432		input_ids,
433		) = self._merge_input_ids_with_image_features(
434		image_features,
435		feature_lens,
436		inputs_embeds,
437		input_ids,
438		attention_mask,
439		position_ids,
440		labels=labels,
441		image_token_index=self.config.image_token_index,
442		)
443		if video_features is not None:
444		(
445		inputs_embeds,
446		attention_mask,
447		position_ids,
448		labels,
449		input_ids,
450		) = self._merge_input_ids_with_image_features(
451		video_features,
452		video_feature_lens,
453		inputs_embeds,
454		input_ids,
455		attention_mask,
456		position_ids,
457		labels=labels,
458		image_token_index=self.config.video_token_index,
459		)
	425	iterator = (
	426	(image_features, feature_lens, self.config.image_token_index),
	427	(video_features, video_feature_lens, self.config.video_token_index),
	428	)

molbap281 days ago👍 1

that's great 🔥

zucchini-nlp requested a review from

molbap 281 days ago

zucchini-nlp281 days ago

This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.

molbap281 days ago

Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index in the modeling?

molbap approved these changes on 2024-08-07

molbap280 days ago

LGTM! Some minor comments remaining but seems good

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/transformers/models/paligemma/modeling_paligemma.py

344			padding_mask, min_dtype
345			)
346
347			final_labels = torch.full(
348			(batch_size, sequence_length), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
349			)
350			final_labels = torch.where(input_ids != self.pad_token_id, labels, final_labels)

molbap280 days ago

If the labels are not defined in the same way - i.e. not nulled where padding tokens are - it'll break BC for existing FT scripts, right?

zucchini-nlp279 days ago

Hmm, do we not expect users to ignore padding while preparing the labels? We can bring this back for BC but afaik the general rule is that LLMs don't mask out pad tokens in labels

zucchini-nlp279 days ago

returned back the masking, and added a warning that users should mask labels themselves

molbap278 days ago

Thanks! During training padding for uneven batches is definitely masked in labels, iiuc

Conversation is marked as resolved

Show resolved

Conversation is marked as resolved

Show resolved

src/transformers/models/paligemma/modeling_paligemma.py

526	497	use_cache=True,
527	498	**kwargs,
528	499	):
529		# If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
530		# Exception 1: when passing input_embeds, input_ids may be missing entries
531		# Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
532		if past_key_values is not None:
533		if inputs_embeds is not None: # Exception 1
534		input_ids = input_ids[:, -cache_position.shape[0] :]
535		elif input_ids.shape[1] != cache_position.shape[0]: # Default case (the "else", a no op, is Exception 2)
536		input_ids = input_ids[:, cache_position]
537
538		if attention_mask is not None and position_ids is None:
539		# create position_ids on the fly for batch generation
540		position_ids = attention_mask.long().cumsum(-1) - 1
541		position_ids.masked_fill_(attention_mask == 0, 1)
542		if past_key_values:
543		position_ids = position_ids[:, -input_ids.shape[1] :]
544
545		# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
546		if inputs_embeds is not None and cache_position[0] == 0:
547		model_inputs = {"inputs_embeds": inputs_embeds}
548		else:
549		model_inputs = {"input_ids": input_ids.contiguous()} # `contiguous()` needed for compilation use cases
550
551		model_inputs.update(
552		{
553		"position_ids": position_ids,
554		"past_key_values": past_key_values,
555		"cache_position": cache_position,
556		"use_cache": use_cache,
557		"attention_mask": attention_mask,
558		"pixel_values": pixel_values,
559		"token_type_ids": token_type_ids,
560		}
	500	model_inputs = self.language_model.prepare_inputs_for_generation(
	501	input_ids,
	502	past_key_values=past_key_values,
	503	inputs_embeds=inputs_embeds,
	504	attention_mask=attention_mask,
	505	cache_position=cache_position,
	506	**kwargs,

molbap280 days ago

much tidier, thanks!

tests/models/llava/test_modeling_llava.py

	209
	210		# overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
	211		# while some other models require pixel_values to be present
	212		def test_inputs_embeds_matches_input_ids(self):
	213		config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
	214
	215		for model_class in self.all_model_classes:
	216		model = model_class(config)
	217		model.to(torch_device)
	218		model.eval()
	219
	220		inputs = self._prepare_for_class(inputs_dict, model_class)
	221		input_ids = inputs["input_ids"]
	222		del inputs["input_ids"]
	223		del inputs["pixel_values"]
	224
	225		inputs_embeds = model.get_input_embeddings()(input_ids)
	226
	227		with torch.no_grad():
	228		out_ids = model(input_ids=input_ids, **inputs)[0]
	229		out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
	230		self.assertTrue(torch.allclose(out_embeds, out_ids))

molbap280 days ago

Interesting, same remark, would be worth having in common or an option that checks needed inputs for a given model to do this del on-demand? (nit)

zucchini-nlp279 days ago (edited 279 days ago)👍 1

yeah, I tries but seems like some models require to have both, ids and pixels, while other require only one. Will have to think about unifying Vision2Seq model tests somehow, in the scope of another PR

Shouldn't be big problem to repeat the code in tests

tests/models/paligemma/test_modeling_paligemma.py

	203		del inputs["input_ids"]
	204		del inputs["pixel_values"]
	205
	206		wte = model.get_input_embeddings()

molbap280 days ago👀 1

same remark for wte and the transformers version warning, to modify before merge!

Update src/transformers/models/blip_2/processing_blip_2.py

85fbff9f

Update src/transformers/models/blip_2/processing_blip_2.py

119178fc

more updates

24519118

fix tests

414031e0

zucchini-nlp requested a review from

amyeroberts 279 days ago

amyeroberts approved these changes on 2024-08-08

amyeroberts279 days ago

Looks great - thanks for handling all of this!

src/transformers/models/blip_2/configuration_blip_2.py

	307		qformer_config=None,
	308		text_config=None,
	309		num_query_tokens=32,
	310		image_token_index=None,

amyeroberts279 days ago

index or token_id? Index would indicate a specific location, but the logic in 1776 looks like it's matching token_ids

zucchini-nlp278 days ago

It's token index same as in all llava models

Conversation is marked as resolved

Show resolved

zucchini-nlp278 days ago❤ 1

I'll run slow tests and check everything is okey, will merge some time next week

Merge remote-tracking branch 'upstream/main' into vlm_processors

8cfad207

zucchini-nlp merged a29eabd0 into main 274 days ago

pspdada204 days ago (edited 204 days ago)

I encountered the warning: "Expanding inputs for image tokens in LLaVa should be done in processing. Please add patch_size and vision_feature_select_strategy to the model's processing config or set directly with processor.patch_size = {{patch_size}} and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}. Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Could you please guide me on how to choose the patch_size and vision_feature_select_strategy based on the model? Are there any related documents available? @zucchini-nlp

zucchini-nlp204 days ago

@pspdada hey! The patch_size and vision_feature_select_strategy should be taken from model config, so that vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

The message says about deprecation in v4.47 but we encountered a few things to be done so the target version will be moved a few more versions up. After the issues we encountered are fixed, I'll update official checkpoints on the hub. Until then you are welcome to use the old way (and ignore the warning) or try out the new logic by setting necessary params in the processor :)

pspdada204 days ago (edited 204 days ago)

vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

Thank you for your response, but there is an issue.
I use transformers 4.46.0.dev0 using pip install --upgrade git+https://github.com/huggingface/transformers.git, which means this pull request has already taken effect (because it has been merged into the main branch).

When using the following code to load the llava 1.5 model and generate with it:

def _create_v_1_5(self) -> tuple[LlavaForConditionalGeneration, LlavaProcessor]:
    model_name = f"llava-hf/llava-1.5-{self.model_size}-hf"

    model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        torch_dtype=self.torch_dtype,
        device_map="auto",
        low_cpu_mem_usage=True,
        attn_implementation=attn_implementation,
    ).to(self.device).eval()

    processor: LlavaProcessor = LlavaProcessor.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        padding_side="left",
        vision_feature_select_strategy=model.config.vision_feature_select_strategy,
        patch_size=model.config.vision_config.patch_size,
    )
    print(model.config.vision_feature_select_strategy, model.config.vision_config.patch_size)

    return model, processor

def _gen(
    self,
    images: list[Image.Image],
    prompts: list[str],
    max_token: int,
    do_sample: list,
    temp: float,
    eos_token_id: list[int] | None = None,
) -> list:
    with torch.inference_mode():
        inputs = self.processor(
            images=images,
            text=prompts,
            return_tensors="pt",
            return_token_type_ids=False,
            padding=True,
        ).to(self.device, torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_token,
            temperature=temp if do_sample else None,
            do_sample=do_sample,
            use_cache=True,
            eos_token_id=eos_token_id,
        )
        decoded_outputs: list[str] = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=False,
            clean_up_tokenization_spaces=False,
        )

    return decoded_outputs

The output is: default 14 and an error occured.
An error occurs:

File "/root/llm-project/LVLM/model/llava.py", line 202, in _gen
    generated_ids = self.model.generate(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2220, in generate
    result = self._sample(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 524, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 325440, features 576

If I remove those two lines, everything works fine.

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

Is this a phenomenon that under control? or should I open a new issue about it and provide full infomation?

pspdada204 days ago (edited 204 days ago)

I ues the following code to decode the output from the model.generate:

new_sentence = decoded_output.strip('<pad>').strip('<s>').strip()[len(prompt) + 1:].strip('</s>').strip()

An interesting phenomenon is that the original output was:

The image features a man and a woman standing close to each other, posing for a picture.

However, after adding these two lines:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

the output becomes particularly strange:

...ge><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> 
Please describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture.

zucchini-nlp203 days ago

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

pspdada203 days ago❤ 1

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

Thank you for your attention, and I wish for a good outcome.

pspdada186 days ago (edited 186 days ago)

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

I've discovered a new issue after the merge of #34332. In the latest version of transformers==4.46.2, problems still occur when I set:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

After setting these parameters, the result of batch inference (without truncation, directly using batch_decode output) is as follows:

['<s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>', 
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

However, when I remove these two lines, everything works as expected, and the result is:

['<s> USER: <image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
 '<pad><pad><pad><pad><pad><pad><pad><s> USER: <image> \nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

Shall I open a new issue about it and provide full infomation? @zucchini-nlp

zucchini-nlp180 days ago

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

pspdada180 days ago (edited 180 days ago)👀 1

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

@zucchini-nlp But the output of the model seems to be not quite right this way. Look at my previous example; if we remove the <Image> and <pad>, what remains is:

 \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s>

If I don't set the two line, it will be:

\nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s>

zucchini-nlp180 days ago👀 2

@pspdada so the new logic generates gibberish when the inputs are batched? I don't see such behavior with the latest release. Can you verify if the padding side if set to left as it is recommended in docs (processor.tokenizer.padding_side = "left")

	130		prompt_strings.append(sample)
	131		else:
	132		prompt_strings = text
	133		logger.warning_once(
	134		"Expanding inputs for image tokens in LLaVa should be done in processing. "
	135		"Please add `patch_size` and `vision_feature_select_strategy` to the model's image processing config. "
	136		"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
	137		)

374	373	output_attentions: Optional[bool] = None,
375	374	output_hidden_states: Optional[bool] = None,
376	375	return_dict: Optional[bool] = None,
	376	cache_position: Optional[torch.LongTensor] = None,

1880	1894	attention_mask = torch.ones_like(input_ids)
1881		attention_mask = torch.cat([language_attention_mask, attention_mask.to(language_attention_mask.device)], dim=1)
1882	1895
1883		# concatenate query embeddings with prompt embeddings
1884		inputs_embeds = self.get_input_embeddings()(input_ids)
1885		inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1)
	1896	# if the model already has "image_token_index" then the input is expanded to account for image embeds
	1897	# otherwise we expand manually by concating

	153		image_tokens = self.image_token.content * self.num_query_tokens
	154		image_token_encoding = self.tokenizer([image_tokens], add_special_tokens=False, return_tensors=None)
	155		for k in _text_encoding:
	156		text_encoding[k] = list(map(concat, image_token_encoding[k], _text_encoding[k]))

	142		and self.vision_feature_select_strategy is not None
	143		):
	144		# Replace the image token with the expanded image token sequence
	145		height, width = get_image_size(to_numpy_array(pixel_values[0]))

transformers
Expand inputs in processors for VLMs
#30962

Merged

Expand inputs in processors for VLMs #30962

What does this PR do?

45	48
46	49	def __init__(self, image_processor=None, tokenizer=None):
47	50	super().__init__(image_processor, tokenizer)
	51	self.image_size = self.image_processor.size["shortest_edge"]

47	50	super().__init__(image_processor, tokenizer)
	51	self.image_size = self.image_processor.size["shortest_edge"]
	52	self.patch_size = getattr(self.image_processor, "patch_size", None)
	53	self.image_token = "<image>"

	205		inputs["inputs_embeds"] = wte(input_ids)
	206
	207		with torch.no_grad():
	208		model(**inputs)[0]

	423		"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
	424		)
	425
	426		if pixel_values is not None and inputs_embeds is not None:
	427		raise ValueError(
	428		"You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one"
	429		)

	64		image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases
	65		video_token="<video>",
	66		):
	67		self._patch_size = patch_size
	68		self._vision_feature_select_strategy = vision_feature_select_strategy

	435		# if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
	436		# not very reliable, but we don't expect one to actually pass 500+ images for one prompt
	437		# In case we're in decoding stage, legacy behavior is checked by presence of pixel values even if use_cache=True
	438		legacy_processing = ((input_ids == self.config.image_token_index).sum(1).max() < 576) or (

	193		num_image_tokens = unpadded_features + newline_features + base_features
	194		return num_image_tokens
	195
	196		def _get_unpadded_features(self, height, width, patches_height, patches_width, scale_height, scale_width):

	192
	193		# videos are easier, simply get frames and multiply
	194		if videos_inputs:
	195		one_video = to_numpy_array(videos_inputs.get("pixel_values_videos")[0])

	1047		predictions = model.generate(**inputs, do_sample=False, max_new_tokens=15)
	1048		generated_text = processor.batch_decode(predictions, skip_special_tokens=True)[0].strip()
	1049
	1050		# Add args to the config to trigger new logic when inputs are expanded in processing file
	1051		processor.num_query_tokens = model.config.num_query_tokens
	1052		model.resize_token_embeddings(processor.tokenizer.vocab_size, pad_to_multiple_of=64)
	1053		model.config.image_token_index = processor.tokenizer.vocab_size

	207		with torch.no_grad():
	208		model(**inputs)
	209
	210		def test_inputs_embeds_matches_input_ids(self):
	211		config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
	212
	213		for model_class in self.all_model_classes:
	214		model = model_class(config)
	215		model.to(torch_device)
	216		model.eval()
	217
	218		inputs = self._prepare_for_class(inputs_dict, model_class)
	219		input_ids = inputs["input_ids"]
	220		input_ids[input_ids == model.config.image_token_index] = 1 # remove image tokens
	221		del inputs["input_ids"]
	222		del inputs["pixel_values"]
	223
	224		inputs_embeds = model.get_input_embeddings()(input_ids)
	225
	226		with torch.no_grad():
	227		out_ids = model(input_ids=input_ids, **inputs)[0]
	228		out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
	229		self.assertTrue(torch.allclose(out_embeds, out_ids))
	230

89	89	"use_labels": True,
90		"image_size": 30,
91		"patch_size": 2,
	90	"image_size": 20,
	91	"patch_size": 5,

37	47	tokenizer (`AutoTokenizer`):
38	48	An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
	49	num_query_tokens (`int`, optional):
	50	MNumber of tokens used by the Qformer as queries, should be same as in model's config.

48	59	tokenizer.return_token_type_ids = False
	60	self.current_processor = image_processor
	61	self.image_token = AddedToken("<image>", normalized=False, special=True)
	62	tokens_to_add = {"additional_special_tokens": [self.image_token]}
	63	tokenizer.add_special_tokens(tokens_to_add)

	tokens_to_add = {"additional_special_tokens": [self.image_token]}
	tokenizer.add_special_tokens(tokens_to_add)
	tokenizer.add_tokens([self.image_token], special_tokens=True)

	MNumber of tokens used by the Qformer as queries, should be same as in model's config.
	Number of tokens used by the Qformer as queries, should be same as in model's config.

	514
	515		# If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
	516		# Otherwise we need pixel values to be passed to model. NOTE: use_cache=False needs pixel_values always
	517		if cache_position[0] == 0:
	518		model_inputs["pixel_values"] = pixel_values

126	146	)
	147
	148	# if we know how many query tokens, expand text inside processor. We need this hacky manipulation
	149	# because BLIP expects image tokens to be at the beginning even before BOS token

transformers Expand inputs in processors for VLMs #30962 Merged

Expand inputs in processors for VLMs #30962

What does this PR do?

transformers
Expand inputs in processors for VLMs
#30962

Merged