114 | 145 | ||
115 | 146 | return BatchFeature(data={**text_inputs, **image_inputs}) | |
116 | 147 | ||
148 | def _get_number_of_features(self, height: int, width: int) -> int: |
Mostly copied from TGI with minor changes in calculations for unpadding, otherwise it won't work for low resolution images
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling
@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks
What changed:
Thanks for working on this - will be great to have some of this logic unified!
Main comment is about how we set the required arguments for processing in the processor
@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).
Processor
class (with processor.patch_size = patch_size
)Wow - a big piece of work!
Overall looks good to me, just a few comments here and there. I'd like to have a second review from @molbap and a run on the slow tests for all the models touched here
1898 | special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds) | ||
1899 | inputs_embeds[special_image_mask] = language_model_inputs.flatten() | ||
1900 | else: | ||
1901 | logger.warning_once( |
We should make sure the official checkpoints have been updated this way
Will continue reviewing (biig work) but sharing feedback on the 2 first things I tested, Llava and Paligemma
420 | 434 | inputs_embeds = self.get_input_embeddings()(input_ids) | |
421 | 435 | ||
422 | # 2. Merge text and images | ||
423 | if pixel_values is not None and input_ids.shape[1] != 1: | ||
424 | image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) | ||
425 | # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated. | ||
426 | selected_image_feature = image_outputs.hidden_states[vision_feature_layer] | ||
436 | # if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing | ||
437 | # not very reliable, but we don't expect one to actually pass 500+ images for one prompt |
couldn't we also have a very tiny image embeddings sequence length?
For LLaVa models the usual backbone is CLIP (i don't remember which ckpt exactly), which always give 576 length. In case some LLaVas have another vision backbone, I guess user have to switch the image_seq_length
parameter in model's config.
I am thinking if we can catch that and give user an informative error msg? Will explore more this issue
It's a biiig piece of work, nicely done, tests and all! I left a few comments on some things I didn't understand well + paligemma masking in particular
1780 | logger.warning_once( | ||
1781 | "Expanding inputs for image tokens in BLIP-2 should be done in processing. " | ||
1782 | "Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. " | ||
1783 | "Using processors without these attributes in the config is deprecated and will throw an error in v4.44." |
Maybe this, or another version number as we're in 4.44 dev version
"Using processors without these attributes in the config is deprecated and will throw an error in v4.44." | |
"Using processors without these attributes in the config is deprecated and will throw an error in a later version" |
Yes, will update accordingly when we get one step away from merging. I think two-three major versions from current one will work :)
421 | "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
422 | "Using processors without these attributes in the config is deprecated and will throw an error in v4.44." | ||
423 | ) | ||
424 | if input_ids.shape[1] != 1: | ||
425 | if image_features is not None: | ||
426 | inputs_embeds = inputs_embeds.to(image_features.dtype) | ||
427 | ( | ||
428 | inputs_embeds, | ||
429 | attention_mask, | ||
430 | position_ids, | ||
431 | labels, | ||
432 | input_ids, | ||
433 | ) = self._merge_input_ids_with_image_features( | ||
434 | image_features, | ||
435 | feature_lens, | ||
436 | inputs_embeds, | ||
437 | input_ids, | ||
438 | attention_mask, | ||
439 | position_ids, | ||
440 | labels=labels, | ||
441 | image_token_index=self.config.image_token_index, | ||
442 | ) | ||
443 | if video_features is not None: | ||
444 | ( | ||
445 | inputs_embeds, | ||
446 | attention_mask, | ||
447 | position_ids, | ||
448 | labels, | ||
449 | input_ids, | ||
450 | ) = self._merge_input_ids_with_image_features( | ||
451 | video_features, | ||
452 | video_feature_lens, | ||
453 | inputs_embeds, | ||
454 | input_ids, | ||
455 | attention_mask, | ||
456 | position_ids, | ||
457 | labels=labels, | ||
458 | image_token_index=self.config.video_token_index, | ||
459 | ) |
some repeated code here - wonder if we can refactor here a bit to have only one call to _merge
visible, maybe in a comprehension
Hmm let me see. Actually I wanted to go with one merge call for video-image models but iirc something was off with that method. I guess it was way too much logic to handle edge cases
same comment on _merge method as in modeling_llava_next!
304 | target_length = cache_position[-1] + 1 | ||
305 | causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device) | ||
320 | 306 | ||
321 | if token_type_ids is not None and labels is not None: | ||
322 | # we are training thus we need to create a full mask on the image + prefix but causal on suffix | ||
323 | target_length = cache_position[-1] + 1 | ||
324 | causal_mask = torch.full( | ||
325 | (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device | ||
326 | ) | ||
327 | if sequence_length != 1: | ||
307 | # do causal diagonal mask only if training, otherwise attend to the whole prefix |
Just verified again, after rebasing the latest main
that the causal masks for prev and current versions are identical. So iiuc, the mask is full in inference stage, which is how it works currently, The only thing that is masked is padding token.
In training, the image you attached is exactly how it works, with the only exception that the current token can attend to itself (as it's impl in main
branch). The image for some reason says we shouldn't attend.
Will push soon the changes after rebasing main
This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.
Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index
in the modeling?
LGTM! Some minor comments remaining but seems good
344 | padding_mask, min_dtype | ||
345 | ) | ||
346 | |||
347 | final_labels = torch.full( | ||
348 | (batch_size, sequence_length), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device | ||
349 | ) | ||
350 | final_labels = torch.where(input_ids != self.pad_token_id, labels, final_labels) |
If the labels are not defined in the same way - i.e. not nulled where padding tokens are - it'll break BC for existing FT scripts, right?
Hmm, do we not expect users to ignore padding while preparing the labels? We can bring this back for BC but afaik the general rule is that LLMs don't mask out pad tokens in labels
returned back the masking, and added a warning that users should mask labels themselves
Thanks! During training padding for uneven batches is definitely masked in labels, iiuc
209 | |||
210 | # overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs | ||
211 | # while some other models require pixel_values to be present | ||
212 | def test_inputs_embeds_matches_input_ids(self): | ||
213 | config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() | ||
214 | |||
215 | for model_class in self.all_model_classes: | ||
216 | model = model_class(config) | ||
217 | model.to(torch_device) | ||
218 | model.eval() | ||
219 | |||
220 | inputs = self._prepare_for_class(inputs_dict, model_class) | ||
221 | input_ids = inputs["input_ids"] | ||
222 | del inputs["input_ids"] | ||
223 | del inputs["pixel_values"] | ||
224 | |||
225 | inputs_embeds = model.get_input_embeddings()(input_ids) | ||
226 | |||
227 | with torch.no_grad(): | ||
228 | out_ids = model(input_ids=input_ids, **inputs)[0] | ||
229 | out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0] | ||
230 | self.assertTrue(torch.allclose(out_embeds, out_ids)) |
Interesting, same remark, would be worth having in common or an option that checks needed inputs for a given model to do this del
on-demand? (nit)
yeah, I tries but seems like some models require to have both, ids and pixels, while other require only one. Will have to think about unifying Vision2Seq model tests somehow, in the scope of another PR
Shouldn't be big problem to repeat the code in tests
203 | del inputs["input_ids"] | ||
204 | del inputs["pixel_values"] | ||
205 | |||
206 | wte = model.get_input_embeddings() |
same remark for wte and the transformers version warning, to modify before merge!
Looks great - thanks for handling all of this!
307 | qformer_config=None, | ||
308 | text_config=None, | ||
309 | num_query_tokens=32, | ||
310 | image_token_index=None, |
index or token_id? Index would indicate a specific location, but the logic in 1776 looks like it's matching token_ids
It's token index same as in all llava models
I'll run slow tests and check everything is okey, will merge some time next week
I encountered the warning: "Expanding inputs for image tokens in LLaVa should be done in processing. Please add patch_size
and vision_feature_select_strategy
to the model's processing config or set directly with processor.patch_size = {{patch_size}}
and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}
. Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Could you please guide me on how to choose the patch_size
and vision_feature_select_strategy
based on the model? Are there any related documents available? @zucchini-nlp
@pspdada hey! The patch_size
and vision_feature_select_strategy
should be taken from model config, so that vision_feature_select_strategy = model.config.vision_feature_select_strategy
and patch_size = model.config.vision_config.patch_size
The message says about deprecation in v4.47 but we encountered a few things to be done so the target version will be moved a few more versions up. After the issues we encountered are fixed, I'll update official checkpoints on the hub. Until then you are welcome to use the old way (and ignore the warning) or try out the new logic by setting necessary params in the processor :)
vision_feature_select_strategy = model.config.vision_feature_select_strategy
andpatch_size = model.config.vision_config.patch_size
Thank you for your response, but there is an issue.
I use transformers 4.46.0.dev0 using pip install --upgrade git+https://github.com/huggingface/transformers.git
, which means this pull request has already taken effect (because it has been merged into the main branch).
When using the following code to load the llava 1.5 model and generate with it:
def _create_v_1_5(self) -> tuple[LlavaForConditionalGeneration, LlavaProcessor]:
model_name = f"llava-hf/llava-1.5-{self.model_size}-hf"
model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
model_name,
cache_dir=self.model_dir,
torch_dtype=self.torch_dtype,
device_map="auto",
low_cpu_mem_usage=True,
attn_implementation=attn_implementation,
).to(self.device).eval()
processor: LlavaProcessor = LlavaProcessor.from_pretrained(
model_name,
cache_dir=self.model_dir,
padding_side="left",
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
)
print(model.config.vision_feature_select_strategy, model.config.vision_config.patch_size)
return model, processor
def _gen(
self,
images: list[Image.Image],
prompts: list[str],
max_token: int,
do_sample: list,
temp: float,
eos_token_id: list[int] | None = None,
) -> list:
with torch.inference_mode():
inputs = self.processor(
images=images,
text=prompts,
return_tensors="pt",
return_token_type_ids=False,
padding=True,
).to(self.device, torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=max_token,
temperature=temp if do_sample else None,
do_sample=do_sample,
use_cache=True,
eos_token_id=eos_token_id,
)
decoded_outputs: list[str] = self.processor.batch_decode(
generated_ids,
skip_special_tokens=False,
clean_up_tokenization_spaces=False,
)
return decoded_outputs
The output is: default 14
and an error occured.
An error occurs:
File "/root/llm-project/LVLM/model/llava.py", line 202, in _gen
generated_ids = self.model.generate(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2220, in generate
result = self._sample(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3211, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 524, in forward
raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 325440, features 576
If I remove those two lines, everything works fine.
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
Is this a phenomenon that under control? or should I open a new issue about it and provide full infomation?
I ues the following code to decode the output from the model.generate:
new_sentence = decoded_output.strip('<pad>').strip('<s>').strip()[len(prompt) + 1:].strip('</s>').strip()
An interesting phenomenon is that the original output was:
The image features a man and a woman standing close to each other, posing for a picture.
However, after adding these two lines:
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
the output becomes particularly strange:
...ge><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>
Please describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture.
@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon
I've discovered a new issue after the merge of #34332. In the latest version of transformers==4.46.2
, problems still occur when I set:
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
After setting these parameters, the result of batch inference (without truncation, directly using batch_decode
output) is as follows:
['<s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']
However, when I remove these two lines, everything works as expected, and the result is:
['<s> USER: <image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image> \nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']
Shall I open a new issue about it and provide full infomation? @zucchini-nlp
@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False
to remove all image
tokens
@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass
skip_special_tokens=False
to remove allimage
tokens
@zucchini-nlp But the output of the model seems to be not quite right this way. Look at my previous example; if we remove the <Image>
and <pad>
, what remains is:
\nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s>
If I don't set the two line, it will be:
\nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s>
Login to write a write a comment.
What does this PR do?
Fixes #30809, This PR moves the
_merge_inputs_with_vision_embeds
to the processing logics, and thus making VLMs more versatile in terms of generation strategies. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in
transformers
.