PR #42564 [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling`

Add return_dict to get_text_features methods to allow returning 'Base…

4c659771

Add return_dict to get_image_features methods to allow returning 'Bas…

47c2418b

make fixup

b6d6df3b

zucchini-nlp commented on 2025-12-03

Ignore discrepancies for pooler_output, focus on last_hidden_state

aa514197

Update get_image_features for the missing architectures

278b0686

Update all get_audio_features

3b140453

Update get_video_features, except instructblipvideo

b7e0d66d

Merge branch 'main' into feat/normalize_get_features_methods

41bcca84

Run ruff formatting

7eb89b61

Patch Glm4v VisionModel forward with BaseModelOutputWithPooling

57af63d3

Patch instructblip, although backwards incompatibility stands

7285187c

Patch Kosmos2 and Ovis2

fd7be527

Reformat Ovis2

3f183fd4

Avoid now-deprecated return_attentions

391aac93

zucchini-nlp commented on 2025-12-15

Remove NumFrames

f8c887ff

Proposal to simplify get_..._features via TransformersKwargs & check_…

9a251ce0

Revert check_model_inputs, adopt can_return_tuple, accept BC on get_.…

858d9d42

Fix typo: can_return_dict -> can_return_tuple

2a643038

Adopt can_return_tuple for many get_image_features

fc8ee939

Update all get_audio_features, some edge cases handled (e.g. gemma3n)

00aa0f5d

Update most get_video_features, some edge case remain, e.g. instruct…

1ccbf5a3

Patch Fuyu, just return BaseModelOutputWithPooling without pooler

78fa904f

Introduce ModelOutput subclass for Chameleon, patch get_image_features

f082a8e8

Update modeling files with new output formats for get_..._features

9ddd3b43

Update fast_vlm modeling forward from modular llava to remove image_s…

006b2a54

Merge branch 'main' into feat/normalize_get_features_methods

afd5e64e

Update colqwen2 its self.vlm.model.visual call to expect BaseModelOutput

1d6639b7

Replace prior return_dict with check_model_inputs on qwen2_5_vl its V…

d52def37

Use BaseModelOutputWithProjectionAttentions for Kosmos2 to allow retu…

ff676635

Update Emu akin to Chameleon

22522c45

Update the blip architectures with a naive fix

37a53c38

Convert remaining modulars (emu3, janus), patch emu3

440914b6

Merge branch 'main' into feat/normalize_get_features_methods

b6dbddd4

Patch blip test

48353a54

Update deepseek_vl using a new BaseModelOutputWithHighResVisionEncodings

531321c8

Remove 'copied' for blip_2, instructblip and kosmos2 as they required…

70577d2b

Patch qwen3_vl and qwen3_vl_moe, where I used last_hidden_state inste…

f6f90d67

Run repo-consistency

7af0b665

tomaarsen marked this pull request as ready for review 37 days ago

tomaarsen requested a review from

zucchini-nlp 37 days ago

tomaarsen commented on 2025-12-22

zucchini-nlp commented on 2025-12-22

Merge branch 'main' into feat/normalize_get_features_methods

8db6370b

Use kwargs["output_hidden_states"] = True to hardcode output_hidden_s…

cbe007b6

Update new GlmAsr get_audio_features on ForConditionalGeneration

7c34c6ec

Run make style

d9edd994

Try to add _can_record_outputs to florence2

763ddf69

Override JanusVisionModel.forward to avoid bad q-former copy from Blip2

84206403

Import missing BaseModelOutput

e0ea3003

Pop deprecated 'return_attentions', setting 'return_dict' won't be us…

78bd0d01

Reintroduce kwargs filtering in llava etc. for safety re. image_sizes

d348d935

Use BaseModelOutputWithPooling superclass consistently for custom get…

71ea85a2

Update Blip-2 family and its BaseModelOutputWithVisionQformerOutputs

8c59e951

Merge branch 'main' into feat/normalize_get_features_methods

3fff252a

Update glm4v _can_record_outputs

3f4c34bb

Remove check_model_inputs in granite_speech

b39b6d1c

Run make style

af0ccb10

Add _can_record_outputs to Ovis2VisionModel

f8e08d97

Update get_text_features/get_video_features from pe_video

2d747d94

Update missing case on sam3

008e15d3

Update get_text_features type hints to Union[tuple, BaseModelOutputWi…

e92efb9c

Add _can_record_inputs to qwen2_5_omni and qwen2_5_vl

b06a2d2e

Update get_image_features and get_video_features on ernie4_5_vl_moe

4a573afc

Update get_image_features type hints to Union[tuple, BaseModelOutputW…

2c677f9d

Remove @auto_docstring from pe_video, it's seemingly not used on that…

1a8d14be

Update get_video_features type hints to Union[tuple, BaseModelOutputW…

87d22d30

Fix pe_video import issue

8d5802e2

Update forward, test, and docstring for sam3

a9ff924b

Update get_audio_features type hints to Union[tuple, BaseModelOutputW…

8ad35e74

Add simple test case for get_text_features

7c99867a

First attempt to get get_image_features under test, still 26 failures

35feb85f

Resolve several test failures, progress still slow and inconsistent

a64634bd

Merge branch 'main' into feat/normalize_get_features_methods

b5b334f5

zucchini-nlp approved these changes on 2026-01-12

Split up get_..._features tests more, should be simpler to disable/cu…

5ad8ca52

Fix emu3 tests, also track non-temporal ResNet in hidden_states

0284715e

Patch chameleon, emu3, ernie4_5, janus

be41c044

Skip output_attentions for FastVLM, timm doesn't accept it

27430538

Patch groupvit, instructblip, ovis2

76371d8c

Patch paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, and skip test…

88a5804f

Patch qwen3_omni_moe, sam family, edgetam

13875af6

Kill now unused BaseModelOutputWithFeatureMaps

e480bc0e

Remove left-over return_dict from prior attempt

2bd9a49a

Allow for output_hidden_states in theory, but skip impossible tests

54550383

Introduce tests for get_audio_features, fixed all architectures

3f75c03e

Introduce tests for get_video_features, only ernie4_5_vl_moe is failing

5e7d821f

Call post_init on GraniteSpeechCTCEncoder, which was given a PreTrain…

1b8ab38b

Update llava_onevision test suite, only create video pixel_values in …

34677988

Create custom video input for ernie4_5_vl_moe

6f23bf5a

Skip CLIP family tests; they don't support output_hidden_states/outpu…

a8e5f920

Breaking: update Blip2Model.get_text_features to no longer output logits

508955e4

Satisfy test_num_layers_is_small test for align

df4d7512

Test against last_hidden_state against batch_size and hidden_size

1254b295

Skip last_hidden_state shape tests for unusual cases

c8b712f5

Update docstrings via auto_docstring for all get_..._features methods

d6f0fb91

Ensure all auto_doc arguments are documented

51638d6c

Remove redundant docstrings

af3b70fc

Merge branch 'main' into feat/normalize_get_features_methods

4d522c7f

Also patch the new glm_image for get_image_features/output_hidden_states

35640452

Update modular files as per check_docstring rules ...

f7100d3a

Update glm-image dates via fix-repo

a41491fc

tomaarsen requested a review from

ArthurZucker 9 days ago

tomaarsen requested a review from

vasqu 9 days ago

zucchini-nlp commented on 2026-01-15

FloatTensor -> LongTensor for image_tokens

de561226

Add simple last_hidden_state description, fix output typing of Gemma3…

d6fd9174

Add missing `-> tuple | BaseModel...` on check_model_inputs

7329ebc4

Ensure forward typing with check_model_inputs is `-> tuple | BaseMode…

72a9ac95

Undo accidental rename of Ovis2VisionAttention

9b670147

Fix incorrect type hints for blip family

cd881792

Merge branch 'main' into feat/normalize_get_features_methods

b58f3c53

Patch get_image_features for lighton_ocr

e7476694

Explicitly use Ovis2VisionAttention in Ovis2VisionEncoderLayer in mod…

95a55ad7

Update use of get_image_features for lighton_ocr

ef778324

Rerun python utils/add_dates.py

194a1bd7

vasqu commented on 2026-01-14

tomaarsen commented on 2026-01-15

Remove tie_last_hidden_states=False from check_model_inputs from ...

0ce7bacc

Revert accidental metaclip import change

6604784b

Merge branch 'main' into feat/normalize_get_features_methods

07463442

ArthurZucker commented on 2026-01-19

Add missing return_dict=True in get_..._features methods

ed5c1364

Add `output_hidden_states=True` in InternVL get_image_features

3f0c7545

Add missing docstring for llava_next_video get_video_features

061527de

Quick clean-up in _video_features_prepare_config_and_inputs test helper

af776e91

model.set_attn_implementation instead of config._attn_implementation

125a49d7

Add simple docstring to some helper methods re. inputs.

71f9f768

Explain why get_..._features test inputs are overridden

c69c4c54

Undo incorrect return_dict=True change in deepseek_vl_hybrid

72891b92

Revert accidental metaclip import change

0d61f664

Adopt **vision_outputs in instructblip, but mess remains

fa32eff4

Merge branch 'main' into feat/normalize_get_features_methods

1a381aae

ArthurZucker approved these changes on 2026-01-22

Avoid kwargs["output_hidden_states"] = True in get_..._features methods

a1e67675

Update check_model_inputs to default vision args based on config

d9001cc8

Unrelated but important: patch set_attn_implementation for Windows

09232167

Revert output_hidden_states changes on InternVL

e3b774e3

Extend d9001cc (check_model_inputs); remove more vision_feature_layer…

37a495c1

Patch unusual bug: llava_next_video used self.vision_feature_layer

bf9182da

Add unused use_cache to TimmWrapperModel to patch FastVLM

15c2a597

Merge branch 'main' into feat/normalize_get_features_methods

92fe9268

Update check_config_attributes to allow for vision attributes

d8604707

Add tests for config.return_dict=False

45d2c337

permute and quantize separately for the comment

5199c472

Ditch shared custom_args for ernie4_5_vl_moe

9865895b

Move Ernie4_5_VL_MoeVisionAttention next to VisionBlock

276dcaaf

Add missing "attentions" from Florence2 _can_record_outputs

c804de4e

Clarify kwargs.get("image_sizes") in modeling_llava

72a1a093

Remove commented skip_test_image_features_output_shape in chameleon t…

43ec4b38

Add a migration guide under 'Library-wide changes with lesser impact'

4515b29b

vasqu approved these changes on 2026-01-22

Parameterize get_..._features tests with return_dict (True, False, N…

cd4c0cb0

Add comment re. TimmWrapper _can_record_outputs

292ef3ab

Shrink Gemma3nAudioEncoderModelOutput with auto_docstring & superclass

355bcb41

Revert "Unrelated but important: patch set_attn_implementation for Wi…

bf0ae702

Merge branch 'main' into feat/normalize_get_features_methods

d8e786ff

ArthurZucker merged 55dadb86 into main 1 day ago

transformers
[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling`
#42564

Merged

[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` #42564

transformers [`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` #42564 Merged

[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling` #42564

transformers
[`BC`] Update `get_(text|image|audio|video)_features` methods to return `BaseModelOutputWithPooling`
#42564

Merged