Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330)
get_rope_index unconditionally applies tokens_per_second temporal scaling to
both images and videos. For still images (modality_type == 1), this shifts the
temporal position origin to start_position * tokens_per_second instead of
start_position, creating a mismatch with height/width dimensions.
Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video
inputs (modality_type == 2). Still images use time_interval=1, keeping the
temporal origin aligned with height and width at start_position.
Qwen3-VL inherits this fix via super().get_rope_index().
Fixes #45325
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>