transformers
a9f5b3a8 - Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330)

Commit
13 days ago
Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions. Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position. Qwen3-VL inherits this fix via super().get_rope_index(). Fixes #45325 Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Author
Parents
Loading