transformers
55dadb86 - [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` (#42564)

Commit

10 hours ago

[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` (#42564) * Add return_dict to get_text_features methods to allow returning 'BaseModelOutputWithPooling' Added to all architectures except blip-2, which has a much different structure here. It uses 'Blip2TextModelWithProjection' to get these embeddings/features, but this class isn't as simple to use * Add return_dict to get_image_features methods to allow returning 'BaseModelOutputWithPooling' Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods. 2d outputs, 3d outputs, lists of 2d outputs (due to non-matching shapes), existing 'return_attentions' resulting in returning 2-tuple, existing 'return_dict' resulting in returning 3-tuples (???), high quality image embeddings, low quality image embeddings, deepstack image embeddings, etc. etc. etc. And I only went through like 70-80% of all architectures with get_image_features before I gave up. Standardisation of all of these sounds like a lost cause. * make fixup * Ignore discrepancies for pooler_output, focus on last_hidden_state * Update get_image_features for the missing architectures * Update all get_audio_features * Update get_video_features, except instructblipvideo Should be fine though, as that 'get_video_features' doesn't live on the AutoModel class, but the AutoModelForConditionalGeneration class * Run ruff formatting * Patch Glm4v VisionModel forward with BaseModelOutputWithPooling * Patch instructblip, although backwards incompatibility stands * Patch Kosmos2 and Ovis2 * Reformat Ovis2 * Avoid now-deprecated return_attentions * Remove NumFrames * Proposal to simplify get_..._features via TransformersKwargs & check_model_inputs The changes in check_model_inputs aren't the clearest/prettiest, but they work well for now. * Revert check_model_inputs, adopt can_return_tuple, accept BC on get_..._features methods This commit updates all get_text_features methods, even blip_2, which was previously not yet attempted * Fix typo: can_return_dict -> can_return_tuple * Adopt can_return_tuple for many get_image_features A handful of outliers that aren't updated yet, e.g. if there's 2+ ModelOutput classes that are viable, or the vq-based ones For context, the other modeling file classes haven't been updated with the new get_..._features format, nor have the tests * Update all get_audio_features, some edge cases handled (e.g. gemma3n) * Update most get_video_features, some edge case remain, e.g. instructblipvideo * Patch Fuyu, just return BaseModelOutputWithPooling without pooler The Fuyu architecture doesn't have an image encoder: > Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder. * Introduce ModelOutput subclass for Chameleon, patch get_image_features * Update modeling files with new output formats for get_..._features * Update fast_vlm modeling forward from modular llava to remove image_sizes * Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput * Replace prior return_dict with check_model_inputs on qwen2_5_vl its VisionTransformer * Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow returning the projection attentions * Update Emu akin to Chameleon * Update the blip architectures with a naive fix A better solution might be to remove the qformer etc. calls from the get_image/video_features and run those separately in the forward methods. * Convert remaining modulars (emu3, janus), patch emu3 * Patch blip test * Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings * Remove 'copied' for blip_2, instructblip and kosmos2 as they required custom changes * Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state instead of pooler_output * Run repo-consistency * Use kwargs["output_hidden_states"] = True to hardcode output_hidden_states where needed * Update new GlmAsr get_audio_features on ForConditionalGeneration * Run make style * Try to add _can_record_outputs to florence2 * Override JanusVisionModel.forward to avoid bad q-former copy from Blip2 * Import missing BaseModelOutput * Pop deprecated 'return_attentions', setting 'return_dict' won't be useful iiuc * Reintroduce kwargs filtering in llava etc. for safety re. image_sizes We also don't need to incorporate code cleanup etc. in this PR, we should keep it as minimal as possible and leave these kinds of lines intact. * Use BaseModelOutputWithPooling superclass consistently for custom get_..._features outputs * Update Blip-2 family and its BaseModelOutputWithVisionQformerOutputs To use both a vision_outputs and qformer_outputs as keys in the BaseModelOutputWithPooling subclass, despite some duplication. * Update glm4v _can_record_outputs * Remove check_model_inputs in granite_speech I could also use can_return_tuple, but this might be problematic if `return_dict=False` in the config * Run make style * Add _can_record_outputs to Ovis2VisionModel * Update get_text_features/get_video_features from pe_video * Update missing case on sam3 * Update get_text_features type hints to Union[tuple, BaseModelOutputWithPooling] Blip-2 and Clvp are the only exceptions * Add _can_record_inputs to qwen2_5_omni and qwen2_5_vl * Update get_image_features and get_video_features on ernie4_5_vl_moe Can we even use BaseModelOutputWithPooling for these? It's a MoE model * Update get_image_features type hints to Union[tuple, BaseModelOutputWithPooling] With a handful of exceptions * Remove @auto_docstring from pe_video, it's seemingly not used on that arch (or well documented) * Update get_video_features type hints to Union[tuple, BaseModelOutputWithPooling] Only exceptions for BaseModelOutputWithDeepstackFeatures * Fix pe_video import issue * Update forward, test, and docstring for sam3 * Update get_audio_features type hints to Union[tuple, BaseModelOutputWithPooling] Also update BaseModelOutput to BaseModelOutputWithPooling in several places, leaving room for a potential pooled embedding to be computed by get_audio_features * Add simple test case for get_text_features Fails on CLIP, MetaCLIP, Siglip, Siglip2 as they use 'self.text_model = text_model.text_model', bypassing the TextModel that has `check_model_inputs` cc @zucchini-nlp related to #42564 * First attempt to get get_image_features under test, still 26 failures * Resolve several test failures, progress still slow and inconsistent * Split up get_..._features tests more, should be simpler to disable/customize specific parts per arch * Fix emu3 tests, also track non-temporal ResNet in hidden_states * Patch chameleon, emu3, ernie4_5, janus * Skip output_attentions for FastVLM, timm doesn't accept it But I'm not sure how to handle the output_hidden_states case * Patch groupvit, instructblip, ovis2 plus style * Patch paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, and skip test for perception_lm perception_lm is still problematic with output_hidden_states, akin to fast_vlm * Patch qwen3_omni_moe, sam family, edgetam P.s. edgetam had incorrect _can_record_outputs Now, all issues that remain with get_image_features are due to 1) CLIP family issue and 2) unclarity with expected output_hidden_states for timm-based models * Kill now unused BaseModelOutputWithFeatureMaps * Remove left-over return_dict from prior attempt * Allow for output_hidden_states in theory, but skip impossible tests The tests are failing as edgetam doesn't output hidden_states. It used to, because of a broken TimmWrapper in _can_return_outputs. * Introduce tests for get_audio_features, fixed all architectures * Introduce tests for get_video_features, only ernie4_5_vl_moe is failing It's failing as the split_sizes gets made too small, such that the video_embeds doesn't sum to the split_sizes anymore. I'm not sure how to best tackle it. I also removed the get_video_features from PaddleOCR_vl, as I don't think it's meant to be used with video * Call post_init on GraniteSpeechCTCEncoder, which was given a PreTrainedModel subclass * Update llava_onevision test suite, only create video pixel_values in new method Instead of in the common one, as that negatively affects other tests (as there's no video tokens in the inputs_ids then) * Create custom video input for ernie4_5_vl_moe * Skip CLIP family tests; they don't support output_hidden_states/output_attentions due to bug * Breaking: update Blip2Model.get_text_features to no longer output logits * Satisfy test_num_layers_is_small test for align * Test against last_hidden_state against batch_size and hidden_size 19 failures, mostly if architectures merge the first dimension with e.g. num_frames for videos, or swap dimensions from the norm with the hidden_state at index 1 in a 4d-tensor I don't think it's reasonable to expect these to be 'fixed', they would require drastic changes in the architectures or somewhat arbitrary changes in the post-processing of the hidden states. * Skip last_hidden_state shape tests for unusual cases E.g. when batch_size is merged with num_frames or num_patches, or hidden_size is in index -3 instead of index -1 * Update docstrings via auto_docstring for all get_..._features methods Also add to e.g. aria.md to ensure that get_..._features methods are documented * Ensure all auto_doc arguments are documented * Remove redundant docstrings * Also patch the new glm_image for get_image_features/output_hidden_states * Update modular files as per check_docstring rules ... ... to avoid modular/check_docstring conflicts. Modular would propargate changes from modular to modeling files, and then check_docstring would complain and update the modeling files only. This created an unstable state where one of the two scripts was unhappy. I resolved this by manually tracking down the check_docstring issues in the modular files. * Update glm-image dates via fix-repo * FloatTensor -> LongTensor for image_tokens * Add simple last_hidden_state description, fix output typing of Gemma3nAudioEncoder.forward * Add missing `-> tuple | BaseModel...` on check_model_inputs Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]*\):`` * Ensure forward typing with check_model_inputs is `-> tuple | BaseModel...` Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]+\) -> (?!tuple | )`` * Undo accidental rename of Ovis2VisionAttention * Fix incorrect type hints for blip family * Patch get_image_features for lighton_ocr * Explicitly use Ovis2VisionAttention in Ovis2VisionEncoderLayer in modular * Update use of get_image_features for lighton_ocr Forgot to run tests to verify that it worked, oops * Rerun python utils/add_dates.py Not sure which script removed the date... :/ * Remove tie_last_hidden_states=False from check_model_inputs from ... forward methods that previously did not return a BaseModelOutput * Revert accidental metaclip import change * Add missing return_dict=True in get_..._features methods * Add `output_hidden_states=True` in InternVL get_image_features Only if needed * Add missing docstring for llava_next_video get_video_features * Quick clean-up in _video_features_prepare_config_and_inputs test helper * model.set_attn_implementation instead of config._attn_implementation Note: There's about ~10 other places that use config._attn_implementation in this test file alone * Add simple docstring to some helper methods re. inputs. It's not extremely useful I think, as it has to be somewhat generic due to the large differences in the architectures * Explain why get_..._features test inputs are overridden * Undo incorrect return_dict=True change in deepseek_vl_hybrid I added return_dict to get_low_res_image_features and get_high_res_image_features calls, but these methods already set return_dict automatically * Revert accidental metaclip import change * Adopt **vision_outputs in instructblip, but mess remains * Avoid kwargs["output_hidden_states"] = True in get_..._features methods * Update check_model_inputs to default vision args based on config * Unrelated but important: patch set_attn_implementation for Windows idem with set_experts_implementation * Revert output_hidden_states changes on InternVL On this architecture, it seems cleaner to go the `kwargs["output_hidden_states"] = True` route, as a simple `output_hidden_states=vision_feature_layer != -1` prevents setting the `output_hidden_states` to True if requested for downstream use. * Extend d9001cc (check_model_inputs); remove more vision_feature_layer defaulting * Patch unusual bug: llava_next_video used self.vision_feature_layer Doesn't seem like this was being used elsewhere, so I can just update it to use the local variant like elsewhere * Add unused use_cache to TimmWrapperModel to patch FastVLM FastVLM now forwards this argument due to the check_model_inputs, and TimmWrapper can't use it * Update check_config_attributes to allow for vision attributes And rerun fix-repo * Add tests for config.return_dict=False Also; siglip had "nested" check_model_inputs: the VisionModel and VisionTransformer (below it) both used `check_model_inputs`. This means that the VisionModel.forward eats the 'return_dict=True', and the lower VisionTransformer.forward its `check_model_inputs` uses the config.return_dict=False to turn the output to a tuple. The siglip/clip/metaclip family is still broken due to the `text_model = text_model.text_model` bypassing the class with the `check_model_inputs`. * permute and quantize separately for the comment * Ditch shared custom_args for ernie4_5_vl_moe * Move Ernie4_5_VL_MoeVisionAttention next to VisionBlock * Add missing "attentions" from Florence2 _can_record_outputs * Clarify kwargs.get("image_sizes") in modeling_llava * Remove commented skip_test_image_features_output_shape in chameleon tests * Add a migration guide under 'Library-wide changes with lesser impact' * Parameterize get_..._features tests with return_dict (True, False, None) * Add comment re. TimmWrapper _can_record_outputs * Shrink Gemma3nAudioEncoderModelOutput with auto_docstring & superclass * Revert "Unrelated but important: patch set_attn_implementation for Windows" This reverts commit 092321671197d878e48b1c89edd154f47bb43a30.

References

#42564 - [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling`

Author

tomaarsen

Parents

c173472e

transformers 55dadb86 - [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` (#42564)

transformers
55dadb86 - [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` (#42564)