Extract dynamic vision/audio tensors into standalone pure functions (#45396)
* Extract pure vision/audio functions into standalone utilities
- Create top-level `modeling_vision_utils.py` with shared pure functions:
`get_vision_cu_seqlens`, `get_rotary_pos_ids`, `get_rotary_pos_ids_interleaved`,
`get_window_index`, `get_pos_embed_indices`
- Move audio precompute functions (`chunk_and_pad_features`, `get_audio_cu_seqlens`,
`get_valid_indices`, `get_pool_indices`) into modular files directly
- Simplify `VisionRotaryEmbedding.forward` to accept `pos_ids` tensor directly
via broadcast multiply, eliminating data-dependent table creation
- Make vision/audio encoder forwards accept optional precomputed tensors
(`cu_seqlens`, `rotary_pos_ids`, `window_index`, `embed_indices`, etc.)
- Use explicit naming: `get_vision_cu_seqlens` / `get_audio_cu_seqlens`
Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe,
qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr,
ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix stale compute_ docstring references to match actual function names
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Revert mlcd changes — not part of this PR
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix
* kwargs
* opt-in
* fix dtype
* style
* guard torch import
* standarize
* propagate inputs
* fix docs
* fix docs
* auto docs
* more docs fixing
* fix omni
* fix paddle
* revert paddle ocr until another time
* finally fixed paddle ocr
* fix review
* revert chunking
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix torch compilable check
* fix docs
* correct func name
* fix omni
* fix video llama 3
* fix video llama 3
* requires torch
* add missing grid device
* keep rot emb in fp32
* fix test device
* fix flm4v flex attention test
* rename to vision utils
* only one get_rotary_pos_ids is needed
* style
* style
* deprecate only
* fix
* simplify and revert processor changes
* renames
* move some stuff to their original place
* style
* style
* use chunked attention
* use decorator
* pass kwargs and return_dict
* fix missing
* keep in and get from kwargs
* revert some trailing commas
* fix
* fixes
* video llama fixes
* fix qwen3 vl
* forgot glm ocr
* address comments
* drop unnecessary
* use correct flash attn check
* missed deprecation
* empty commit 1
* empty commit 2
* revert video llama 3 config changes
* style
* style fix
* address comments
* remove unnecessary
* revert TransformersKwargs and add a todo
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>