transformers
a9f5b3a8 - Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330)

Commit

13 days ago

Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions. Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position. Qwen3-VL inherits this fix via super().get_rope_index(). Fixes #45325 Co-authored-by: Raushan Turganbay <raushan@huggingface.co>

References

#45330 - Fix Qwen2.5-VL temporal RoPE scaling applied to still images

Author

Kash6

Parents

d1cca998

transformers a9f5b3a8 - Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330)

transformers
a9f5b3a8 - Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330)