[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` #42564
Add return_dict to get_text_features methods to allow returning 'Base…
4c659771
Add return_dict to get_image_features methods to allow returning 'Bas…
47c2418b
make fixup
b6d6df3b
Ignore discrepancies for pooler_output, focus on last_hidden_state
aa514197
Update get_image_features for the missing architectures
278b0686
Update all get_audio_features
3b140453
Update get_video_features, except instructblipvideo
b7e0d66d
Merge branch 'main' into feat/normalize_get_features_methods
41bcca84
Run ruff formatting
7eb89b61
Patch Glm4v VisionModel forward with BaseModelOutputWithPooling
57af63d3
Patch instructblip, although backwards incompatibility stands
7285187c
Patch Kosmos2 and Ovis2
fd7be527
Reformat Ovis2
3f183fd4
Avoid now-deprecated return_attentions
391aac93
Remove NumFrames
f8c887ff
Proposal to simplify get_..._features via TransformersKwargs & check_…
9a251ce0
Revert check_model_inputs, adopt can_return_tuple, accept BC on get_.…
858d9d42
Fix typo: can_return_dict -> can_return_tuple
2a643038
Adopt can_return_tuple for many get_image_features
fc8ee939
Update all get_audio_features, some edge cases handled (e.g. gemma3n)
00aa0f5d
Update most get_video_features, some edge case remain, e.g. instruct…
1ccbf5a3
Patch Fuyu, just return BaseModelOutputWithPooling without pooler
78fa904f
Introduce ModelOutput subclass for Chameleon, patch get_image_features
f082a8e8
Update modeling files with new output formats for get_..._features
9ddd3b43
Update fast_vlm modeling forward from modular llava to remove image_s…
006b2a54
Merge branch 'main' into feat/normalize_get_features_methods
afd5e64e
Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput
1d6639b7
Replace prior return_dict with check_model_inputs on qwen2_5_vl its V…
d52def37
Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow retu…
ff676635
Update Emu akin to Chameleon
22522c45
Update the blip architectures with a naive fix
37a53c38
Convert remaining modulars (emu3, janus), patch emu3
440914b6
Merge branch 'main' into feat/normalize_get_features_methods
b6dbddd4
Patch blip test
48353a54
Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings
531321c8
Remove 'copied' for blip_2, instructblip and kosmos2 as they required…
70577d2b
Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state inste…
f6f90d67
Run repo-consistency
7af0b665
tomaarsen
marked this pull request as ready for review 37 days ago
Merge branch 'main' into feat/normalize_get_features_methods
8db6370b
Use kwargs["output_hidden_states"] = True to hardcode output_hidden_s…
cbe007b6
Update new GlmAsr get_audio_features on ForConditionalGeneration
7c34c6ec
Run make style
d9edd994
Try to add _can_record_outputs to florence2
763ddf69
Override JanusVisionModel.forward to avoid bad q-former copy from Blip2
84206403
Import missing BaseModelOutput
e0ea3003
Pop deprecated 'return_attentions', setting 'return_dict' won't be us…
78bd0d01
Reintroduce kwargs filtering in llava etc. for safety re. image_sizes
d348d935
Use BaseModelOutputWithPooling superclass consistently for custom get…
71ea85a2
tomaarsen
changed the title [`draft`] Add `return_dict` to `get_(text|image|audio|video)_features` methods [`draft`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` 17 days ago
Update Blip-2 family and its BaseModelOutputWithVisionQformerOutputs
8c59e951
Merge branch 'main' into feat/normalize_get_features_methods
3fff252a
tomaarsen
changed the title [`draft`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` 17 days ago
Update glm4v _can_record_outputs
3f4c34bb
Remove check_model_inputs in granite_speech
b39b6d1c
Run make style
af0ccb10
Add _can_record_outputs to Ovis2VisionModel
f8e08d97
Update get_text_features/get_video_features from pe_video
2d747d94
Update missing case on sam3
008e15d3
Update get_text_features type hints to Union[tuple, BaseModelOutputWi…
e92efb9c
Add _can_record_inputs to qwen2_5_omni and qwen2_5_vl
b06a2d2e
Update get_image_features and get_video_features on ernie4_5_vl_moe
4a573afc
Update get_image_features type hints to Union[tuple, BaseModelOutputW…
2c677f9d
Remove @auto_docstring from pe_video, it's seemingly not used on that…
1a8d14be
Update get_video_features type hints to Union[tuple, BaseModelOutputW…
87d22d30
Fix pe_video import issue
8d5802e2
Update forward, test, and docstring for sam3
a9ff924b
Update get_audio_features type hints to Union[tuple, BaseModelOutputW…
8ad35e74
Add simple test case for get_text_features
7c99867a
First attempt to get get_image_features under test, still 26 failures
35feb85f
Resolve several test failures, progress still slow and inconsistent
a64634bd
Merge branch 'main' into feat/normalize_get_features_methods
b5b334f5
Split up get_..._features tests more, should be simpler to disable/cu…
5ad8ca52
Fix emu3 tests, also track non-temporal ResNet in hidden_states
0284715e
Patch chameleon, emu3, ernie4_5, janus
be41c044
Skip output_attentions for FastVLM, timm doesn't accept it
27430538
Patch groupvit, instructblip, ovis2
76371d8c
Patch paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, and skip test…
88a5804f
Patch qwen3_omni_moe, sam family, edgetam
13875af6
Kill now unused BaseModelOutputWithFeatureMaps
e480bc0e
Remove left-over return_dict from prior attempt
2bd9a49a
Allow for output_hidden_states in theory, but skip impossible tests
54550383
Introduce tests for get_audio_features, fixed all architectures
3f75c03e
Introduce tests for get_video_features, only ernie4_5_vl_moe is failing
5e7d821f
Call post_init on GraniteSpeechCTCEncoder, which was given a PreTrain…
1b8ab38b
Update llava_onevision test suite, only create video pixel_values in …
34677988
Create custom video input for ernie4_5_vl_moe
6f23bf5a
Skip CLIP family tests; they don't support output_hidden_states/outpu…
a8e5f920
Breaking: update Blip2Model.get_text_features to no longer output logits
508955e4
Satisfy test_num_layers_is_small test for align
df4d7512
Test against last_hidden_state against batch_size and hidden_size
1254b295
Skip last_hidden_state shape tests for unusual cases
c8b712f5
Update docstrings via auto_docstring for all get_..._features methods
d6f0fb91
Ensure all auto_doc arguments are documented
51638d6c
Remove redundant docstrings
af3b70fc
Merge branch 'main' into feat/normalize_get_features_methods
4d522c7f
Also patch the new glm_image for get_image_features/output_hidden_states
35640452
Update modular files as per check_docstring rules ...
f7100d3a
Update glm-image dates via fix-repo
a41491fc
FloatTensor -> LongTensor for image_tokens
de561226
Add simple last_hidden_state description, fix output typing of Gemma3…
d6fd9174
Add missing `-> tuple | BaseModel...` on check_model_inputs
7329ebc4
Ensure forward typing with check_model_inputs is `-> tuple | BaseMode…
72a9ac95
Undo accidental rename of Ovis2VisionAttention
9b670147
Fix incorrect type hints for blip family
cd881792
Merge branch 'main' into feat/normalize_get_features_methods
b58f3c53
Patch get_image_features for lighton_ocr
e7476694
Explicitly use Ovis2VisionAttention in Ovis2VisionEncoderLayer in mod…
95a55ad7
Update use of get_image_features for lighton_ocr
ef778324
Rerun python utils/add_dates.py
194a1bd7
vasqu
commented
on 2026-01-14
Remove tie_last_hidden_states=False from check_model_inputs from ...
0ce7bacc
Revert accidental metaclip import change
6604784b
Merge branch 'main' into feat/normalize_get_features_methods
07463442
Add missing return_dict=True in get_..._features methods
ed5c1364
Add `output_hidden_states=True` in InternVL get_image_features
3f0c7545
Add missing docstring for llava_next_video get_video_features
061527de
Quick clean-up in _video_features_prepare_config_and_inputs test helper
af776e91
model.set_attn_implementation instead of config._attn_implementation
125a49d7
Add simple docstring to some helper methods re. inputs.
71f9f768
Explain why get_..._features test inputs are overridden
c69c4c54
Undo incorrect return_dict=True change in deepseek_vl_hybrid
72891b92
Revert accidental metaclip import change
0d61f664
Adopt **vision_outputs in instructblip, but mess remains
fa32eff4
Merge branch 'main' into feat/normalize_get_features_methods
1a381aae
Avoid kwargs["output_hidden_states"] = True in get_..._features methods
a1e67675
Update check_model_inputs to default vision args based on config
d9001cc8
Unrelated but important: patch set_attn_implementation for Windows
09232167
Revert output_hidden_states changes on InternVL
e3b774e3
Extend d9001cc (check_model_inputs); remove more vision_feature_layer…
37a495c1
Patch unusual bug: llava_next_video used self.vision_feature_layer
bf9182da
Add unused use_cache to TimmWrapperModel to patch FastVLM
15c2a597
Merge branch 'main' into feat/normalize_get_features_methods
92fe9268
Update check_config_attributes to allow for vision attributes
d8604707
Add tests for config.return_dict=False
45d2c337
permute and quantize separately for the comment
5199c472
Ditch shared custom_args for ernie4_5_vl_moe
9865895b
Move Ernie4_5_VL_MoeVisionAttention next to VisionBlock
276dcaaf
Add missing "attentions" from Florence2 _can_record_outputs
c804de4e
Clarify kwargs.get("image_sizes") in modeling_llava
72a1a093
Remove commented skip_test_image_features_output_shape in chameleon t…
43ec4b38
Add a migration guide under 'Library-wide changes with lesser impact'
4515b29b
vasqu
approved these changes
on 2026-01-22
Parameterize get_..._features tests with return_dict (True, False, N…
cd4c0cb0
Add comment re. TimmWrapper _can_record_outputs
292ef3ab
Shrink Gemma3nAudioEncoderModelOutput with auto_docstring & superclass
355bcb41
Revert "Unrelated but important: patch set_attn_implementation for Wi…
bf0ae702
Merge branch 'main' into feat/normalize_get_features_methods
d8e786ff
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub