transformers
Expand inputs in processors for VLMs
#30962
Merged

Expand inputs in processors for VLMs #30962

zucchini-nlp
zucchini-nlp357 days ago (edited 300 days ago)🚀 2

What does this PR do?

Fixes #30809, This PR moves the _merge_inputs_with_vision_embeds to the processing logics, and thus making VLMs more versatile in terms of generation strategies. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.

The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in transformers.

zucchini-nlp let it be
050657f7
zucchini-nlp draft
a67087e4
zucchini-nlp should not have changed
1e2b873a
zucchini-nlp
zucchini-nlp commented on 2024-05-22
src/transformers/models/llava_next/processing_llava_next.py
114145
115146 return BatchFeature(data={**text_inputs, **image_inputs})
116147
148
def _get_number_of_features(self, height: int, width: int) -> int:
zucchini-nlp357 days ago

Mostly copied from TGI with minor changes in calculations for unpadding, otherwise it won't work for low resolution images

HuggingFaceDocBuilderDev
HuggingFaceDocBuilderDev357 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap
molbap355 days ago❤ 1

Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling

zucchini-nlp add warnings
70145d46
zucchini-nlp Merge remote-tracking branch 'upstream/main' into vlm_processors
16a67875
zucchini-nlp fix & add tests
8472035b
zucchini-nlp fix tests
13af9e84
zucchini-nlp
zucchini-nlp350 days ago

@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks

What changed:

  • Model can generate from both: old inputs and new-expanded inputs. If it's old inputs, warning is raised, asking to upgrade the processor config.
  • Processor also can return both types. If it has all the necessary parts to calculate image embedding length, the inputs will be expanded. Otherwise, warning is raised and old behavior retained.
  • Old behavior is planned to be totally removed in v4.44 (or better v4.43?)
  • Added tests to check that old vs new inputs generation is identical
  • To actually have llava-based models work in new style, I'll later update all hf-llava configs in the hub. Other models in the hub will continue to work with old behavior
zucchini-nlp zucchini-nlp requested a review from amyeroberts amyeroberts 350 days ago
zucchini-nlp ipnuts embeds cannot be passed with pixels
41d086f1
zucchini-nlp zucchini-nlp marked this pull request as ready for review 350 days ago
amyeroberts
amyeroberts commented on 2024-06-03
amyeroberts345 days ago

Thanks for working on this - will be great to have some of this logic unified!

Main comment is about how we set the required arguments for processing in the processor

Conversation is marked as resolved
Show resolved
src/transformers/models/llava/processing_llava.py
4548
4649 def __init__(self, image_processor=None, tokenizer=None):
4750 super().__init__(image_processor, tokenizer)
51
self.image_size = self.image_processor.size["shortest_edge"]
amyeroberts345 days ago

As llava can take any image processor, we can't assume size will have shortest_edge, it should also be a dict of {"height": h, "width": w}

zucchini-nlp338 days ago

since we can't be sure which one is used by image processor and what's the output size, I decided to check pixel_values size and use that as self.image_size

Also, now we don't assume it's always a square size output and calculate patches for height and width separately, similar to what ViT does in its modeliing file.

Conversation is marked as resolved
Show resolved
src/transformers/models/llava/processing_llava.py
4750 super().__init__(image_processor, tokenizer)
51 self.image_size = self.image_processor.size["shortest_edge"]
52 self.patch_size = getattr(self.image_processor, "patch_size", None)
53
self.image_token = "<image>"
amyeroberts345 days ago

Is this consistent for all the tokenizers?

zucchini-nlp344 days ago

For tokenizers, yes! The only part from tokenizer is "image token" which I believe is established as "<image>" for LLaVas.

For all image processors/vision towers, then maybe no. Calculation of image token seq length is done assuming we get CLIP as image tower. I am not sure if other vision backbones are all similar or require different handling.

In case there are llavas w/o CLIP and it has to be handled differently, I will see if we can unify it without extra code-hacks

zucchini-nlp338 days ago

For existing possible backbones (CLIP, SigLIP, Dino, ViT) the current code will work as expected. Same for tokenizers, and in case any model decided to change image token, the user can set it by processr.image_token="{ANY_NEW_TOKEN}"

Conversation is marked as resolved
Show resolved
src/transformers/models/llava/processing_llava.py
130 prompt_strings.append(sample)
131 else:
132 prompt_strings = text
133
logger.warning_once(
134
"Expanding inputs for image tokens in LLaVa should be done in processing. "
135
"Please add `patch_size` and `vision_feature_select_strategy` to the model's image processing config. "
136
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
137
)
amyeroberts345 days ago

I think this is both going to be annoying to handle and creates leakiness in our abstractions: not everyone who uses the image processor will be using it for llava, and there's no reason for image processors for models which don't have patch_size to have this attribute.

Instead, I'd suggest getting the defaults from the image processor if available, but also allowing them as processor arguments in the init and possibly the forward pass which the user can directly set and control. This way, when we save out the processor, these can be saved alongside, and the user doesn't need to change an image processor or override the processor each time they want to use it

zucchini-nlp344 days ago

Right, that will be easier for users. Will make changes

Conversation is marked as resolved
Show resolved
tests/models/llava/test_modeling_llava.py
205 inputs["inputs_embeds"] = wte(input_ids)
206
207 with torch.no_grad():
208
model(**inputs)[0]
amyeroberts345 days ago

Why index on the 0'th output here ?

zucchini-nlp344 days ago👍 1

no reason, it was simply copied as is from test_modeling_common.py with extra handling of pixel_values as input to forward.

I'll remove it here, as it's not making any difference

zucchini-nlp more updates
bf59ed69
zucchini-nlp paligemma ready!
020e7ed1
zucchini-nlp minor typos
3e0455c8
zucchini-nlp update blip-2
674f16e6
zucchini-nlp fix tests & raise error
42ae6464
zucchini-nlp Merge branch 'main' into vlm_processors
b5259f2d
zucchini-nlp docstring
a6c50deb
zucchini-nlp add blip2 test
4766e2ea
zucchini-nlp Merge branch 'main' into vlm_processors
d46df906
zucchini-nlp
zucchini-nlp338 days ago (edited 337 days ago)

@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).

  • warning text is more clear and it's easy for users to add new attributes to Processor class (with processor.patch_size = patch_size)
  • BLIP-2 needed more modifications as it didn't have special image token, lmk if the way I did works
  • Paligemma worked out of the box but needed changes for causal mask. There's also smth weird with "position_ids" which will be fixed by @molbap
  • All models have their "old-new format equivalence" tests and are passing locally. I don't know how to make happy the failing doctest, it's red even after I deprecated the unused attribute
zucchini-nlp zucchini-nlp requested a review from amyeroberts amyeroberts 338 days ago
zucchini-nlp tmp
f74297bb
amyeroberts amyeroberts added run-slow
amyeroberts
amyeroberts commented on 2024-06-13
amyeroberts335 days ago👀 1

Wow - a big piece of work!

Overall looks good to me, just a few comments here and there. I'd like to have a second review from @molbap and a run on the slow tests for all the models touched here

Conversation is marked as resolved
Show resolved
tests/models/vipllava/test_modeling_vipllava.py
323 output_expanded = model.generate(**inputs_expanded, min_new_tokens=20, max_new_tokens=20)
324
325 # check that both inputs are handled correctly and generate the same output
326
self.assertListEqual(output_expanded[:, -20:].tolist(), output[:, -20:].tolist())
amyeroberts335 days ago

❤️

Conversation is marked as resolved
Show resolved
src/transformers/models/vipllava/modeling_vipllava.py
374373 output_attentions: Optional[bool] = None,
375374 output_hidden_states: Optional[bool] = None,
376375 return_dict: Optional[bool] = None,
376
cache_position: Optional[torch.LongTensor] = None,
amyeroberts335 days ago👍 1

Should.be added to the docstring

Conversation is marked as resolved
Show resolved
src/transformers/models/vipllava/modeling_vipllava.py
423 "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
424 )
425
426
if pixel_values is not None and inputs_embeds is not None:
427
raise ValueError(
428
"You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one"
429
)
amyeroberts335 days ago

Logic check here also isn't consistent with the warning - it doesn't catch if both are None

zucchini-nlp330 days ago👍 1

Both can be None, that happens when we are in decoding stage and get only the last token for generation. We only can't have both set to a tensor, because that means we don't know where are image_token so we can't merge images into embeddings

Conversation is marked as resolved
Show resolved
src/transformers/models/video_llava/processing_video_llava.py
64 image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases
65 video_token="<video>",
66 ):
67
self._patch_size = patch_size
68
self._vision_feature_select_strategy = vision_feature_select_strategy
amyeroberts335 days ago

Why have these set as private?

zucchini-nlp330 days ago

oh, forgot to get it back, was part of an older logic

Conversation is marked as resolved
Show resolved
src/transformers/models/vipllava/modeling_vipllava.py
435 # if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
436 # not very reliable, but we don't expect one to actually pass 500+ images for one prompt
437 # In case we're in decoding stage, legacy behavior is checked by presence of pixel values even if use_cache=True
438
legacy_processing = ((input_ids == self.config.image_token_index).sum(1).max() < 576) or (
amyeroberts331 days ago👍 1

Is the 576 the sequence length? We should take this from the config, rather than hard-code here

zucchini-nlp330 days ago

Yes, it's the length for one image, and can be added as image_seq_length to all configs. I will start updating llava (and maybe others by opening a PR) configs today so the default behavior becomes the "expand in processor"

src/transformers/models/blip_2/modeling_blip_2.py
1898 special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
1899 inputs_embeds[special_image_mask] = language_model_inputs.flatten()
1900 else:
1901
logger.warning_once(
amyeroberts331 days ago👍 1

We should make sure the official checkpoints have been updated this way

zucchini-nlp add image seq length to config
5fc8565c
zucchini-nlp update docstring
1b4674ad
zucchini-nlp Merge branch 'main' into vlm_processors
c3c130bc
zucchini-nlp delete
84388756
zucchini-nlp fix tests
bf9e637b
zucchini-nlp fix blip
db1fa4f3
molbap
molbap commented on 2024-06-20
molbap328 days ago

Will continue reviewing (biig work) but sharing feedback on the 2 first things I tested, Llava and Paligemma

Conversation is marked as resolved
Show resolved
src/transformers/models/paligemma/modeling_paligemma.py
molbap328 days ago

General comment:

  • nice simplification!
  • currently generation seems to be broken on PaliGemma. When doing
from PIL import Image
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor

model = PaliGemmaForConditionalGeneration.from_pretrained('google/paligemma-3b-ft-docvqa-896')
processor = PaliGemmaProcessor.from_pretrained('google/paligemma-3b-ft-docvqa-896')

img = Image.open("/home/pablo/docvqa_example.png").convert("RGB")
text = "answer en what is the document type"

inputs = processor(images=[img], text=text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(outputs))

model.generate(...) works on main, but fails in modeling_gemma when setting the max_past_length attribute which requires the check

if past_key_values.get_max_length() is not None

and it seems that past_key_value is a tuple, raising AttributeError: 'tuple' object has no attribute 'get_max_length'. This happens after the first token is generated, ie when past_key_values is not None

zucchini-nlp328 days ago👍 1

Hmm, weird since we should be doing Cache class inside generate before going to the modeling part. I will check out tomorrow on this

zucchini-nlp327 days ago

Found the bug, PaliGemma didn't have the cache flags on. Resolved!

Conversation is marked as resolved
Show resolved
src/transformers/models/llava/modeling_llava.py
molbap328 days ago (edited 327 days ago)

Same comment as for modeling_paligemma, wondering if it's an issue on my end? after the first token generated, modeling_llama will throw because of

AttributeError: 'tuple' object has no attribute 'get_seq_length'

(this does not happen on main)

src/transformers/models/llava/modeling_llava.py
420434 inputs_embeds = self.get_input_embeddings()(input_ids)
421435
422 # 2. Merge text and images
423 if pixel_values is not None and input_ids.shape[1] != 1:
424 image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
425 # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
426 selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
436 # if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
437
# not very reliable, but we don't expect one to actually pass 500+ images for one prompt
molbap328 days ago

couldn't we also have a very tiny image embeddings sequence length?

zucchini-nlp328 days ago👍 1

For LLaVa models the usual backbone is CLIP (i don't remember which ckpt exactly), which always give 576 length. In case some LLaVas have another vision backbone, I guess user have to switch the image_seq_length parameter in model's config.

I am thinking if we can catch that and give user an informative error msg? Will explore more this issue

zucchini-nlp fix paligemma
246b06a1
zucchini-nlp zucchini-nlp added WIP
zucchini-nlp merge `main`
222bf9a6
zucchini-nlp out-of-place scatter
54862156
zucchini-nlp add llava-next-video
78c44844
molbap molbap requested a review from molbap molbap 287 days ago
molbap
molbap commented on 2024-08-05
molbap282 days ago

It's a biiig piece of work, nicely done, tests and all! I left a few comments on some things I didn't understand well + paligemma masking in particular

src/transformers/models/blip_2/modeling_blip_2.py
1780 logger.warning_once(
1781 "Expanding inputs for image tokens in BLIP-2 should be done in processing. "
1782 "Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. "
1783
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
molbap282 days ago

Maybe this, or another version number as we're in 4.44 dev version

Suggested change
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
"Using processors without these attributes in the config is deprecated and will throw an error in a later version"
zucchini-nlp282 days ago👍 1

Yes, will update accordingly when we get one step away from merging. I think two-three major versions from current one will work :)

Conversation is marked as resolved
Show resolved
src/transformers/models/blip_2/modeling_blip_2.py
18801894 attention_mask = torch.ones_like(input_ids)
1881 attention_mask = torch.cat([language_attention_mask, attention_mask.to(language_attention_mask.device)], dim=1)
18821895
1883 # concatenate query embeddings with prompt embeddings
1884 inputs_embeds = self.get_input_embeddings()(input_ids)
1885 inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1)
1896 # if the model already has "image_token_index" then the input is expanded to account for image embeds
1897
# otherwise we expand manually by concating
molbap282 days ago

typo to change in a few places

Suggested change
# otherwise we expand manually by concating
# otherwise we expand manually by concatenating
Conversation is marked as resolved
Show resolved
src/transformers/models/blip_2/processing_blip_2.py
153 image_tokens = self.image_token.content * self.num_query_tokens
154 image_token_encoding = self.tokenizer([image_tokens], add_special_tokens=False, return_tensors=None)
155 for k in _text_encoding:
156
text_encoding[k] = list(map(concat, image_token_encoding[k], _text_encoding[k]))
molbap282 days ago

Is this equivalent to

text_encoding[k] = [img_encoding + txt_encoding for img_encoding, txt_encoding in zip(image_token_encoding[k], _text_encoding[k])]

? slightly more readable

zucchini-nlp282 days ago👍 1

Yes, same thing, will modify to a readable version

Conversation is marked as resolved
Show resolved
src/transformers/models/llava/processing_llava.py
142 and self.vision_feature_select_strategy is not None
143 ):
144 # Replace the image token with the expanded image token sequence
145
height, width = get_image_size(to_numpy_array(pixel_values[0]))
molbap282 days ago

could also pixel_values.shape[2:] access image size?

zucchini-nlp282 days ago

Could work but if the channel dim is channel-last it would be shape[1:3]. I decided to use image utils instead of bloating processing with more code

molbap282 days ago

ah, that's true - it was to skip the to_numpy_array call, but got it!

Conversation is marked as resolved
Show resolved
src/transformers/models/llava_next/processing_llava_next.py
193 num_image_tokens = unpadded_features + newline_features + base_features
194 return num_image_tokens
195
196
def _get_unpadded_features(self, height, width, patches_height, patches_width, scale_height, scale_width):
molbap282 days ago👍 1

Even if it's a private method, a docstring would help code inspectors here

src/transformers/models/llava_next_video/diff_llava_next_video.py
421 "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
422 "Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
423 )
424
if input_ids.shape[1] != 1:
425
if image_features is not None:
426
inputs_embeds = inputs_embeds.to(image_features.dtype)
427
(
428
inputs_embeds,
429
attention_mask,
430
position_ids,
431
labels,
432
input_ids,
433
) = self._merge_input_ids_with_image_features(
434
image_features,
435
feature_lens,
436
inputs_embeds,
437
input_ids,
438
attention_mask,
439
position_ids,
440
labels=labels,
441
image_token_index=self.config.image_token_index,
442
)
443
if video_features is not None:
444
(
445
inputs_embeds,
446
attention_mask,
447
position_ids,
448
labels,
449
input_ids,
450
) = self._merge_input_ids_with_image_features(
451
video_features,
452
video_feature_lens,
453
inputs_embeds,
454
input_ids,
455
attention_mask,
456
position_ids,
457
labels=labels,
458
image_token_index=self.config.video_token_index,
459
)
molbap282 days ago

some repeated code here - wonder if we can refactor here a bit to have only one call to _merge visible, maybe in a comprehension

zucchini-nlp282 days ago

Hmm let me see. Actually I wanted to go with one merge call for video-image models but iirc something was off with that method. I guess it was way too much logic to handle edge cases

src/transformers/models/llava_next_video/modeling_llava_next_video.py
molbap282 days ago

same comment on _merge method as in modeling_llava_next!

Conversation is marked as resolved
Show resolved
src/transformers/models/llava_next_video/processing_llava_next_video.py
192
193 # videos are easier, simply get frames and multiply
194 if videos_inputs:
195
one_video = to_numpy_array(videos_inputs.get("pixel_values_videos")[0])
molbap282 days ago

noob question: why is it "one video" here?

zucchini-nlp282 days ago

hehe, maybe it is a weird workaround. Similar to shy I used image utils to get image size, I wanted it to be agnostic for video also. But video is has an extra time dimension (2nd), so we can take one video and treat it like a batched image. The get_image_size doesn't work with videos

I have in my TODO list to make video LLMs more standard and add some uniform helper functions, but I'll probably work on it not soon...

molbap282 days ago👍 1

Alright, I see! Adding what you just said in a comment would make it clearer 😁

src/transformers/models/paligemma/modeling_paligemma.py
304 target_length = cache_position[-1] + 1
305 causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
320306
321 if token_type_ids is not None and labels is not None:
322 # we are training thus we need to create a full mask on the image + prefix but causal on suffix
323 target_length = cache_position[-1] + 1
324 causal_mask = torch.full(
325 (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
326 )
327 if sequence_length != 1:
307
# do causal diagonal mask only if training, otherwise attend to the whole prefix
molbap282 days ago👀 1

I don't think that's accurate: even in training, paligemma fully attends to the prefix in a non-causal way, and is causal only for the decoded part. See
image

zucchini-nlp281 days ago

Just verified again, after rebasing the latest main that the causal masks for prev and current versions are identical. So iiuc, the mask is full in inference stage, which is how it works currently, The only thing that is masked is padding token.

In training, the image you attached is exactly how it works, with the only exception that the current token can attend to itself (as it's impl in main branch). The image for some reason says we shouldn't attend.

Will push soon the changes after rebasing main

Conversation is marked as resolved
Show resolved
temp.py
molbap282 days ago👀 1

to delete

Conversation is marked as resolved
Show resolved
tests/models/blip_2/test_modeling_blip_2.py
1047 predictions = model.generate(**inputs, do_sample=False, max_new_tokens=15)
1048 generated_text = processor.batch_decode(predictions, skip_special_tokens=True)[0].strip()
1049
1050
# Add args to the config to trigger new logic when inputs are expanded in processing file
1051
processor.num_query_tokens = model.config.num_query_tokens
1052
model.resize_token_embeddings(processor.tokenizer.vocab_size, pad_to_multiple_of=64)
1053
model.config.image_token_index = processor.tokenizer.vocab_size
molbap282 days ago

If we modify the checkpoint, this test will be moot, right? wdyt about making sure these flags are not set before testing?

zucchini-nlp282 days ago👍 1

yes, definitely, don't know why I ignored blip and added it for others

Conversation is marked as resolved
Show resolved
tests/models/llava/test_modeling_llava.py
207 with torch.no_grad():
208 model(**inputs)
209
210
def test_inputs_embeds_matches_input_ids(self):
211
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
212
213
for model_class in self.all_model_classes:
214
model = model_class(config)
215
model.to(torch_device)
216
model.eval()
217
218
inputs = self._prepare_for_class(inputs_dict, model_class)
219
input_ids = inputs["input_ids"]
220
input_ids[input_ids == model.config.image_token_index] = 1 # remove image tokens
221
del inputs["input_ids"]
222
del inputs["pixel_values"]
223
224
inputs_embeds = model.get_input_embeddings()(input_ids)
225
226
with torch.no_grad():
227
out_ids = model(input_ids=input_ids, **inputs)[0]
228
out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
229
self.assertTrue(torch.allclose(out_embeds, out_ids))
230
molbap282 days ago

What's the incompatibility with the method from test_modeling_common with the same name? Something related to pad token ids? Could we modify the common method instead to account for the new VLM standard? Same Q for test_input_embeds below

zucchini-nlp282 days ago

Yes, it's about "pixel-values" or "pixel-values-videos" that have to be popped cause we can't pass them with inputs_embeds for VLMs. I'll check if it's generalizable enough to be added in the "common" tests, otherwise we can later override these in the VLMTesterMixin

Conversation is marked as resolved
Show resolved
tests/models/paligemma/test_modeling_paligemma.py
8989 "use_labels": True,
90 "image_size": 30,
91 "patch_size": 2,
90
"image_size": 20,
91
"patch_size": 5,
molbap282 days ago

Curious, why did this need to change?

zucchini-nlp281 days ago👍 1

Oh, just remembered why. The bigger image size/patch size we have, the longer is the sequence length of img embeddings. I thought we don't need 600+ image features for tiny models, but can bring it back for sure

Conversation is marked as resolved
Show resolved
tests/models/video_llava/test_modeling_video_llava.py
593
594 @slow
595 @require_bitsandbytes
596
def test_expansion_in_processing(self):
597
model_id = "LanguageBind/Video-LLaVA-7B-hf"
598
model = VideoLlavaForConditionalGeneration.from_pretrained(model_id, load_in_4bit=True)
599
processor = VideoLlavaProcessor.from_pretrained(model_id)
600
601
prompt = "USER: <video>Describe the video in details. ASSISTANT:"
602
video_file = hf_hub_download(
603
repo_id="raushan-testing-hf/videos-test", filename="video_demo.npy", repo_type="dataset"
604
)
605
video_file = np.load(video_file)
606
607
# check processing with expansion of inputs
608
processor.vision_feature_select_strategy = "default"
609
processor.patch_size = 14
610
inputs_expanded = processor(prompt, videos=video_file, return_tensors="pt").to(torch_device, torch.float16)
611
self.assertTrue(inputs_expanded.input_ids.shape[-1] == 2073)
612
613
# check processing without expansion of inputs (legacy behavior)
614
processor.vision_feature_select_strategy = None
615
processor.patch_size = None
616
inputs = processor(prompt, videos=video_file, return_tensors="pt").to(torch_device, torch.float16)
617
self.assertTrue(inputs.input_ids.shape[-1] == 18)
618
619
# generate exactly 20 tokens
620
output = model.generate(**inputs, min_new_tokens=20, max_new_tokens=20)
621
output_expanded = model.generate(**inputs_expanded, min_new_tokens=20, max_new_tokens=20)
622
623
# check that both inputs are handled correctly and generate the same output
624
self.assertListEqual(output_expanded[:, -20:].tolist(), output[:, -20:].tolist())
molbap282 days ago

Really cool to have systematic testing for expansion 🚀 - prompts formats differ from model to model, but the logic is always the same, wonder if we could have a common testing method here

zucchini-nlp282 days ago🚀 1

Yes, we could have a common testing method. Adding VLMTesterMixin which expands on GenerationTesterMixin is the second thing we need after this PR is merged

zucchini-nlp Update src/transformers/models/blip_2/modeling_blip_2.py
d60624ef
zucchini-nlp remove tmp
1973b39f
zucchini-nlp merge `main`
a6e380fe
zucchini-nlp codestyle
8e88d8bb
zucchini-nlp nits
689eed93
zucchini-nlp more nits
28e80546
zucchini-nlp remove overriding in tests
637e5142
zucchini-nlp comprehension when merging video
be939d88
zucchini-nlp fix-copies
232eb7c9
zucchini-nlp revert changes for embeds test
385a6174
zucchini-nlp fix tests after making comprehension
4831a7e6
molbap
molbap commented on 2024-08-06
Conversation is marked as resolved
Show resolved
src/transformers/models/llava_next_video/diff_llava_next_video.py
422422 "Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
423423 )
424424 if input_ids.shape[1] != 1:
425 if image_features is not None:
426 inputs_embeds = inputs_embeds.to(image_features.dtype)
427 (
428 inputs_embeds,
429 attention_mask,
430 position_ids,
431 labels,
432 input_ids,
433 ) = self._merge_input_ids_with_image_features(
434 image_features,
435 feature_lens,
436 inputs_embeds,
437 input_ids,
438 attention_mask,
439 position_ids,
440 labels=labels,
441 image_token_index=self.config.image_token_index,
442 )
443 if video_features is not None:
444 (
445 inputs_embeds,
446 attention_mask,
447 position_ids,
448 labels,
449 input_ids,
450 ) = self._merge_input_ids_with_image_features(
451 video_features,
452 video_feature_lens,
453 inputs_embeds,
454 input_ids,
455 attention_mask,
456 position_ids,
457 labels=labels,
458 image_token_index=self.config.video_token_index,
459 )
425
iterator = (
426
(image_features, feature_lens, self.config.image_token_index),
427
(video_features, video_feature_lens, self.config.video_token_index),
428
)
molbap281 days ago👍 1

that's great 🔥

zucchini-nlp zucchini-nlp requested a review from molbap molbap 281 days ago
zucchini-nlp
zucchini-nlp281 days ago

This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.

molbap
molbap281 days ago

Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index in the modeling?

molbap
molbap approved these changes on 2024-08-07
molbap280 days ago

LGTM! Some minor comments remaining but seems good

Conversation is marked as resolved
Show resolved
src/transformers/models/blip_2/processing_blip_2.py
3747 tokenizer (`AutoTokenizer`):
3848 An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
49 num_query_tokens (`int`, *optional*):
50
MNumber of tokens used by the Qformer as queries, should be same as in model's config.
molbap280 days ago
Suggested change
MNumber of tokens used by the Qformer as queries, should be same as in model's config.
Number of tokens used by the Qformer as queries, should be same as in model's config.
Conversation is marked as resolved
Show resolved
src/transformers/models/blip_2/processing_blip_2.py
4859 tokenizer.return_token_type_ids = False
60 self.current_processor = image_processor
61 self.image_token = AddedToken("<image>", normalized=False, special=True)
62
tokens_to_add = {"additional_special_tokens": [self.image_token]}
63
tokenizer.add_special_tokens(tokens_to_add)
molbap280 days ago

Missed this the first time, I don't think we need to use additional_special_tokens here, add_tokens should be enough, no? both give the same token_id, we just lose the additional... key, which is AFAIK unused

Suggested change
tokens_to_add = {"additional_special_tokens": [self.image_token]}
tokenizer.add_special_tokens(tokens_to_add)
tokenizer.add_tokens([self.image_token], special_tokens=True)
zucchini-nlp279 days ago

yeah, seems like we don't need to access token through additional, okey

src/transformers/models/paligemma/modeling_paligemma.py
344 padding_mask, min_dtype
345 )
346
347
final_labels = torch.full(
348
(batch_size, sequence_length), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
349
)
350
final_labels = torch.where(input_ids != self.pad_token_id, labels, final_labels)
molbap280 days ago

If the labels are not defined in the same way - i.e. not nulled where padding tokens are - it'll break BC for existing FT scripts, right?

zucchini-nlp279 days ago

Hmm, do we not expect users to ignore padding while preparing the labels? We can bring this back for BC but afaik the general rule is that LLMs don't mask out pad tokens in labels

zucchini-nlp279 days ago

returned back the masking, and added a warning that users should mask labels themselves

molbap278 days ago

Thanks! During training padding for uneven batches is definitely masked in labels, iiuc

Conversation is marked as resolved
Show resolved
src/transformers/models/paligemma/modeling_paligemma.py
514
515 # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
516 # Otherwise we need pixel values to be passed to model. NOTE: use_cache=False needs pixel_values always
517
if cache_position[0] == 0:
518
model_inputs["pixel_values"] = pixel_values
molbap280 days ago

👍

Conversation is marked as resolved
Show resolved
src/transformers/models/paligemma/modeling_paligemma.py
526497 use_cache=True,
527498 **kwargs,
528499 ):
529 # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
530 # Exception 1: when passing input_embeds, input_ids may be missing entries
531 # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
532 if past_key_values is not None:
533 if inputs_embeds is not None: # Exception 1
534 input_ids = input_ids[:, -cache_position.shape[0] :]
535 elif input_ids.shape[1] != cache_position.shape[0]: # Default case (the "else", a no op, is Exception 2)
536 input_ids = input_ids[:, cache_position]
537
538 if attention_mask is not None and position_ids is None:
539 # create position_ids on the fly for batch generation
540 position_ids = attention_mask.long().cumsum(-1) - 1
541 position_ids.masked_fill_(attention_mask == 0, 1)
542 if past_key_values:
543 position_ids = position_ids[:, -input_ids.shape[1] :]
544
545 # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
546 if inputs_embeds is not None and cache_position[0] == 0:
547 model_inputs = {"inputs_embeds": inputs_embeds}
548 else:
549 model_inputs = {"input_ids": input_ids.contiguous()} # `contiguous()` needed for compilation use cases
550
551 model_inputs.update(
552 {
553 "position_ids": position_ids,
554 "past_key_values": past_key_values,
555 "cache_position": cache_position,
556 "use_cache": use_cache,
557 "attention_mask": attention_mask,
558 "pixel_values": pixel_values,
559 "token_type_ids": token_type_ids,
560 }
500
model_inputs = self.language_model.prepare_inputs_for_generation(
501
input_ids,
502
past_key_values=past_key_values,
503
inputs_embeds=inputs_embeds,
504
attention_mask=attention_mask,
505
cache_position=cache_position,
506
**kwargs,
molbap280 days ago

much tidier, thanks!

tests/models/llava/test_modeling_llava.py
209
210 # overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
211 # while some other models require pixel_values to be present
212
def test_inputs_embeds_matches_input_ids(self):
213
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
214
215
for model_class in self.all_model_classes:
216
model = model_class(config)
217
model.to(torch_device)
218
model.eval()
219
220
inputs = self._prepare_for_class(inputs_dict, model_class)
221
input_ids = inputs["input_ids"]
222
del inputs["input_ids"]
223
del inputs["pixel_values"]
224
225
inputs_embeds = model.get_input_embeddings()(input_ids)
226
227
with torch.no_grad():
228
out_ids = model(input_ids=input_ids, **inputs)[0]
229
out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
230
self.assertTrue(torch.allclose(out_embeds, out_ids))
molbap280 days ago

Interesting, same remark, would be worth having in common or an option that checks needed inputs for a given model to do this del on-demand? (nit)

zucchini-nlp279 days ago (edited 279 days ago)👍 1

yeah, I tries but seems like some models require to have both, ids and pixels, while other require only one. Will have to think about unifying Vision2Seq model tests somehow, in the scope of another PR

Shouldn't be big problem to repeat the code in tests

tests/models/paligemma/test_modeling_paligemma.py
203 del inputs["input_ids"]
204 del inputs["pixel_values"]
205
206
wte = model.get_input_embeddings()
molbap280 days ago👀 1

same remark for wte and the transformers version warning, to modify before merge!

zucchini-nlp Update src/transformers/models/blip_2/processing_blip_2.py
85fbff9f
zucchini-nlp Update src/transformers/models/blip_2/processing_blip_2.py
119178fc
zucchini-nlp more updates
24519118
zucchini-nlp fix tests
414031e0
zucchini-nlp zucchini-nlp requested a review from amyeroberts amyeroberts 279 days ago
amyeroberts
amyeroberts approved these changes on 2024-08-08
amyeroberts279 days ago

Looks great - thanks for handling all of this!

src/transformers/models/blip_2/configuration_blip_2.py
307 qformer_config=None,
308 text_config=None,
309 num_query_tokens=32,
310
image_token_index=None,
amyeroberts279 days ago

index or token_id? Index would indicate a specific location, but the logic in 1776 looks like it's matching token_ids

zucchini-nlp278 days ago

It's token index same as in all llava models

Conversation is marked as resolved
Show resolved
src/transformers/models/blip_2/processing_blip_2.py
126146 )
147
148 # if we know how many query tokens, expand text inside processor. We need this hacky manipulation
149
# because BLIP expects image tokens to be at the beginning even before BOS token
amyeroberts279 days ago👍 1

Nice comment :)

zucchini-nlp
zucchini-nlp278 days ago❤ 1

I'll run slow tests and check everything is okey, will merge some time next week

zucchini-nlp Merge remote-tracking branch 'upstream/main' into vlm_processors
8cfad207
zucchini-nlp zucchini-nlp merged a29eabd0 into main 274 days ago
pspdada
pspdada204 days ago (edited 204 days ago)

I encountered the warning: "Expanding inputs for image tokens in LLaVa should be done in processing. Please add patch_size and vision_feature_select_strategy to the model's processing config or set directly with processor.patch_size = {{patch_size}} and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}. Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Could you please guide me on how to choose the patch_size and vision_feature_select_strategy based on the model? Are there any related documents available? @zucchini-nlp

zucchini-nlp
zucchini-nlp204 days ago

@pspdada hey! The patch_size and vision_feature_select_strategy should be taken from model config, so that vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

The message says about deprecation in v4.47 but we encountered a few things to be done so the target version will be moved a few more versions up. After the issues we encountered are fixed, I'll update official checkpoints on the hub. Until then you are welcome to use the old way (and ignore the warning) or try out the new logic by setting necessary params in the processor :)

pspdada
pspdada204 days ago (edited 204 days ago)

vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

Thank you for your response, but there is an issue.
I use transformers 4.46.0.dev0 using pip install --upgrade git+https://github.com/huggingface/transformers.git, which means this pull request has already taken effect (because it has been merged into the main branch).

When using the following code to load the llava 1.5 model and generate with it:

def _create_v_1_5(self) -> tuple[LlavaForConditionalGeneration, LlavaProcessor]:
    model_name = f"llava-hf/llava-1.5-{self.model_size}-hf"

    model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        torch_dtype=self.torch_dtype,
        device_map="auto",
        low_cpu_mem_usage=True,
        attn_implementation=attn_implementation,
    ).to(self.device).eval()

    processor: LlavaProcessor = LlavaProcessor.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        padding_side="left",
        vision_feature_select_strategy=model.config.vision_feature_select_strategy,
        patch_size=model.config.vision_config.patch_size,
    )
    print(model.config.vision_feature_select_strategy, model.config.vision_config.patch_size)

    return model, processor

def _gen(
    self,
    images: list[Image.Image],
    prompts: list[str],
    max_token: int,
    do_sample: list,
    temp: float,
    eos_token_id: list[int] | None = None,
) -> list:
    with torch.inference_mode():
        inputs = self.processor(
            images=images,
            text=prompts,
            return_tensors="pt",
            return_token_type_ids=False,
            padding=True,
        ).to(self.device, torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_token,
            temperature=temp if do_sample else None,
            do_sample=do_sample,
            use_cache=True,
            eos_token_id=eos_token_id,
        )
        decoded_outputs: list[str] = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=False,
            clean_up_tokenization_spaces=False,
        )

    return decoded_outputs

The output is: default 14 and an error occured.
An error occurs:

File "/root/llm-project/LVLM/model/llava.py", line 202, in _gen
    generated_ids = self.model.generate(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2220, in generate
    result = self._sample(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 524, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 325440, features 576

If I remove those two lines, everything works fine.

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

Is this a phenomenon that under control? or should I open a new issue about it and provide full infomation?

pspdada
pspdada204 days ago (edited 204 days ago)

I ues the following code to decode the output from the model.generate:

new_sentence = decoded_output.strip('<pad>').strip('<s>').strip()[len(prompt) + 1:].strip('</s>').strip()

An interesting phenomenon is that the original output was:

The image features a man and a woman standing close to each other, posing for a picture.

However, after adding these two lines:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

the output becomes particularly strange:

...ge><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> 
Please describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture.
zucchini-nlp
zucchini-nlp203 days ago

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

pspdada
pspdada203 days ago❤ 1

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

Thank you for your attention, and I wish for a good outcome.

pspdada
pspdada186 days ago (edited 186 days ago)

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

I've discovered a new issue after the merge of #34332. In the latest version of transformers==4.46.2, problems still occur when I set:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

After setting these parameters, the result of batch inference (without truncation, directly using batch_decode output) is as follows:

['<s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>', 
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

However, when I remove these two lines, everything works as expected, and the result is:

['<s> USER: <image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
 '<pad><pad><pad><pad><pad><pad><pad><s> USER: <image> \nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

Shall I open a new issue about it and provide full infomation? @zucchini-nlp

zucchini-nlp
zucchini-nlp180 days ago

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

pspdada
pspdada180 days ago (edited 180 days ago)👀 1

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

@zucchini-nlp But the output of the model seems to be not quite right this way. Look at my previous example; if we remove the <Image> and <pad>, what remains is:

 \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s>

If I don't set the two line, it will be:

\nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s>
zucchini-nlp
zucchini-nlp180 days ago👀 2

@pspdada so the new logic generates gibberish when the inputs are batched? I don't see such behavior with the latest release. Can you verify if the padding side if set to left as it is recommended in docs (processor.tokenizer.padding_side = "left")

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone