[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` (#42564)
* Add return_dict to get_text_features methods to allow returning 'BaseModelOutputWithPooling'
Added to all architectures except blip-2, which has a much different structure here. It uses 'Blip2TextModelWithProjection' to get these embeddings/features, but this class isn't as simple to use
* Add return_dict to get_image_features methods to allow returning 'BaseModelOutputWithPooling'
Well, the architectures supporting get_image_features are all extremely different, with wildly different outputs for the get_image_features methods. 2d outputs, 3d outputs, lists of 2d outputs (due to non-matching shapes), existing 'return_attentions' resulting in returning 2-tuple, existing 'return_dict' resulting in returning 3-tuples (???), high quality image embeddings, low quality image embeddings, deepstack image embeddings, etc. etc. etc.
And I only went through like 70-80% of all architectures with get_image_features before I gave up.
Standardisation of all of these sounds like a lost cause.
* make fixup
* Ignore discrepancies for pooler_output, focus on last_hidden_state
* Update get_image_features for the missing architectures
* Update all get_audio_features
* Update get_video_features, except instructblipvideo
Should be fine though, as that 'get_video_features' doesn't live on the AutoModel class, but the AutoModelForConditionalGeneration class
* Run ruff formatting
* Patch Glm4v VisionModel forward with BaseModelOutputWithPooling
* Patch instructblip, although backwards incompatibility stands
* Patch Kosmos2 and Ovis2
* Reformat Ovis2
* Avoid now-deprecated return_attentions
* Remove NumFrames
* Proposal to simplify get_..._features via TransformersKwargs & check_model_inputs
The changes in check_model_inputs aren't the clearest/prettiest, but they work well for now.
* Revert check_model_inputs, adopt can_return_tuple, accept BC on get_..._features methods
This commit updates all get_text_features methods, even blip_2, which was previously not yet attempted
* Fix typo: can_return_dict -> can_return_tuple
* Adopt can_return_tuple for many get_image_features
A handful of outliers that aren't updated yet, e.g. if there's 2+ ModelOutput classes that are viable, or the vq-based ones
For context, the other modeling file classes haven't been updated with the new get_..._features format, nor have the tests
* Update all get_audio_features, some edge cases handled (e.g. gemma3n)
* Update most get_video_features, some edge case remain, e.g. instructblipvideo
* Patch Fuyu, just return BaseModelOutputWithPooling without pooler
The Fuyu architecture doesn't have an image encoder:
> Architecturally, Fuyu is a vanilla decoder-only transformer - there is no image encoder.
* Introduce ModelOutput subclass for Chameleon, patch get_image_features
* Update modeling files with new output formats for get_..._features
* Update fast_vlm modeling forward from modular llava to remove image_sizes
* Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput
* Replace prior return_dict with check_model_inputs on qwen2_5_vl its VisionTransformer
* Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow returning the projection attentions
* Update Emu akin to Chameleon
* Update the blip architectures with a naive fix
A better solution might be to remove the qformer etc. calls from the get_image/video_features and run those separately in the forward methods.
* Convert remaining modulars (emu3, janus), patch emu3
* Patch blip test
* Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings
* Remove 'copied' for blip_2, instructblip and kosmos2 as they required custom changes
* Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state instead of pooler_output
* Run repo-consistency
* Use kwargs["output_hidden_states"] = True to hardcode output_hidden_states where needed
* Update new GlmAsr get_audio_features on ForConditionalGeneration
* Run make style
* Try to add _can_record_outputs to florence2
* Override JanusVisionModel.forward to avoid bad q-former copy from Blip2
* Import missing BaseModelOutput
* Pop deprecated 'return_attentions', setting 'return_dict' won't be useful iiuc
* Reintroduce kwargs filtering in llava etc. for safety re. image_sizes
We also don't need to incorporate code cleanup etc. in this PR, we should keep it as minimal as possible and leave these kinds of lines intact.
* Use BaseModelOutputWithPooling superclass consistently for custom get_..._features outputs
* Update Blip-2 family and its BaseModelOutputWithVisionQformerOutputs
To use both a vision_outputs and qformer_outputs as keys in the BaseModelOutputWithPooling subclass, despite some duplication.
* Update glm4v _can_record_outputs
* Remove check_model_inputs in granite_speech
I could also use can_return_tuple, but this might be problematic if `return_dict=False` in the config
* Run make style
* Add _can_record_outputs to Ovis2VisionModel
* Update get_text_features/get_video_features from pe_video
* Update missing case on sam3
* Update get_text_features type hints to Union[tuple, BaseModelOutputWithPooling]
Blip-2 and Clvp are the only exceptions
* Add _can_record_inputs to qwen2_5_omni and qwen2_5_vl
* Update get_image_features and get_video_features on ernie4_5_vl_moe
Can we even use BaseModelOutputWithPooling for these? It's a MoE model
* Update get_image_features type hints to Union[tuple, BaseModelOutputWithPooling]
With a handful of exceptions
* Remove @auto_docstring from pe_video, it's seemingly not used on that arch
(or well documented)
* Update get_video_features type hints to Union[tuple, BaseModelOutputWithPooling]
Only exceptions for BaseModelOutputWithDeepstackFeatures
* Fix pe_video import issue
* Update forward, test, and docstring for sam3
* Update get_audio_features type hints to Union[tuple, BaseModelOutputWithPooling]
Also update BaseModelOutput to BaseModelOutputWithPooling in several places, leaving room for a potential pooled embedding to be computed by get_audio_features
* Add simple test case for get_text_features
Fails on CLIP, MetaCLIP, Siglip, Siglip2 as they use 'self.text_model = text_model.text_model', bypassing the TextModel that has `check_model_inputs` cc @zucchini-nlp related to #42564
* First attempt to get get_image_features under test, still 26 failures
* Resolve several test failures, progress still slow and inconsistent
* Split up get_..._features tests more, should be simpler to disable/customize specific parts per arch
* Fix emu3 tests, also track non-temporal ResNet in hidden_states
* Patch chameleon, emu3, ernie4_5, janus
* Skip output_attentions for FastVLM, timm doesn't accept it
But I'm not sure how to handle the output_hidden_states case
* Patch groupvit, instructblip, ovis2
plus style
* Patch paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, and skip test for perception_lm
perception_lm is still problematic with output_hidden_states, akin to fast_vlm
* Patch qwen3_omni_moe, sam family, edgetam
P.s. edgetam had incorrect _can_record_outputs
Now, all issues that remain with get_image_features are due to 1) CLIP family issue and 2) unclarity with expected output_hidden_states for timm-based models
* Kill now unused BaseModelOutputWithFeatureMaps
* Remove left-over return_dict from prior attempt
* Allow for output_hidden_states in theory, but skip impossible tests
The tests are failing as edgetam doesn't output hidden_states. It used to, because of a broken TimmWrapper in _can_return_outputs.
* Introduce tests for get_audio_features, fixed all architectures
* Introduce tests for get_video_features, only ernie4_5_vl_moe is failing
It's failing as the split_sizes gets made too small, such that the video_embeds doesn't sum to the split_sizes anymore. I'm not sure how to best tackle it.
I also removed the get_video_features from PaddleOCR_vl, as I don't think it's meant to be used with video
* Call post_init on GraniteSpeechCTCEncoder, which was given a PreTrainedModel subclass
* Update llava_onevision test suite, only create video pixel_values in new method
Instead of in the common one, as that negatively affects other tests (as there's no video tokens in the inputs_ids then)
* Create custom video input for ernie4_5_vl_moe
* Skip CLIP family tests; they don't support output_hidden_states/output_attentions due to bug
* Breaking: update Blip2Model.get_text_features to no longer output logits
* Satisfy test_num_layers_is_small test for align
* Test against last_hidden_state against batch_size and hidden_size
19 failures, mostly if architectures merge the first dimension with e.g. num_frames for videos, or swap dimensions from the norm with the hidden_state at index 1 in a 4d-tensor
I don't think it's reasonable to expect these to be 'fixed', they would require drastic changes in the architectures or somewhat arbitrary changes in the post-processing of the hidden states.
* Skip last_hidden_state shape tests for unusual cases
E.g. when batch_size is merged with num_frames or num_patches, or hidden_size is in index -3 instead of index -1
* Update docstrings via auto_docstring for all get_..._features methods
Also add to e.g. aria.md to ensure that get_..._features methods are documented
* Ensure all auto_doc arguments are documented
* Remove redundant docstrings
* Also patch the new glm_image for get_image_features/output_hidden_states
* Update modular files as per check_docstring rules ...
... to avoid modular/check_docstring conflicts. Modular would propargate changes from modular to modeling files, and then check_docstring would complain and update the modeling files only. This created an unstable state where one of the two scripts was unhappy. I resolved this by manually tracking down the check_docstring issues in the modular files.
* Update glm-image dates via fix-repo
* FloatTensor -> LongTensor for image_tokens
* Add simple last_hidden_state description, fix output typing of Gemma3nAudioEncoder.forward
* Add missing `-> tuple | BaseModel...` on check_model_inputs
Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]*\):``
* Ensure forward typing with check_model_inputs is `-> tuple | BaseModel...`
Using ``check_model_inputs[^\n]*\n\s*def forward\([^\)]+\) -> (?!tuple | )``
* Undo accidental rename of Ovis2VisionAttention
* Fix incorrect type hints for blip family
* Patch get_image_features for lighton_ocr
* Explicitly use Ovis2VisionAttention in Ovis2VisionEncoderLayer in modular
* Update use of get_image_features for lighton_ocr
Forgot to run tests to verify that it worked, oops
* Rerun python utils/add_dates.py
Not sure which script removed the date... :/
* Remove tie_last_hidden_states=False from check_model_inputs from ...
forward methods that previously did not return a BaseModelOutput
* Revert accidental metaclip import change
* Add missing return_dict=True in get_..._features methods
* Add `output_hidden_states=True` in InternVL get_image_features
Only if needed
* Add missing docstring for llava_next_video get_video_features
* Quick clean-up in _video_features_prepare_config_and_inputs test helper
* model.set_attn_implementation instead of config._attn_implementation
Note: There's about ~10 other places that use config._attn_implementation in this test file alone
* Add simple docstring to some helper methods re. inputs.
It's not extremely useful I think, as it has to be somewhat generic due to the large differences in the architectures
* Explain why get_..._features test inputs are overridden
* Undo incorrect return_dict=True change in deepseek_vl_hybrid
I added return_dict to get_low_res_image_features and get_high_res_image_features calls, but these methods already set return_dict automatically
* Revert accidental metaclip import change
* Adopt **vision_outputs in instructblip, but mess remains
* Avoid kwargs["output_hidden_states"] = True in get_..._features methods
* Update check_model_inputs to default vision args based on config
* Unrelated but important: patch set_attn_implementation for Windows
idem with set_experts_implementation
* Revert output_hidden_states changes on InternVL
On this architecture, it seems cleaner to go the `kwargs["output_hidden_states"] = True` route, as a simple `output_hidden_states=vision_feature_layer != -1` prevents setting the `output_hidden_states` to True if requested for downstream use.
* Extend d9001cc (check_model_inputs); remove more vision_feature_layer defaulting
* Patch unusual bug: llava_next_video used self.vision_feature_layer
Doesn't seem like this was being used elsewhere, so I can just update it to use the local variant like elsewhere
* Add unused use_cache to TimmWrapperModel to patch FastVLM
FastVLM now forwards this argument due to the check_model_inputs, and TimmWrapper can't use it
* Update check_config_attributes to allow for vision attributes
And rerun fix-repo
* Add tests for config.return_dict=False
Also; siglip had "nested" check_model_inputs: the VisionModel and VisionTransformer (below it) both used `check_model_inputs`. This means that the VisionModel.forward eats the 'return_dict=True', and the lower VisionTransformer.forward its `check_model_inputs` uses the config.return_dict=False to turn the output to a tuple.
The siglip/clip/metaclip family is still broken due to the `text_model = text_model.text_model` bypassing the class with the `check_model_inputs`.
* permute and quantize separately for the comment
* Ditch shared custom_args for ernie4_5_vl_moe
* Move Ernie4_5_VL_MoeVisionAttention next to VisionBlock
* Add missing "attentions" from Florence2 _can_record_outputs
* Clarify kwargs.get("image_sizes") in modeling_llava
* Remove commented skip_test_image_features_output_shape in chameleon tests
* Add a migration guide under 'Library-wide changes with lesser impact'
* Parameterize get_..._features tests with return_dict (True, False, None)
* Add comment re. TimmWrapper _can_record_outputs
* Shrink Gemma3nAudioEncoderModelOutput with auto_docstring & superclass
* Revert "Unrelated but important: patch set_attn_implementation for Windows"
This reverts commit 092321671197d878e48b1c89edd154f47bb43a30.