transformers
f00940ef - Extract dynamic vision/audio tensors into standalone pure functions (#45396)

Commit

1 day ago

Extract dynamic vision/audio tensors into standalone pure functions (#45396) * Extract pure vision/audio functions into standalone utilities - Create top-level `modeling_vision_utils.py` with shared pure functions: `get_vision_cu_seqlens`, `get_rotary_pos_ids`, `get_rotary_pos_ids_interleaved`, `get_window_index`, `get_pos_embed_indices` - Move audio precompute functions (`chunk_and_pad_features`, `get_audio_cu_seqlens`, `get_valid_indices`, `get_pool_indices`) into modular files directly - Simplify `VisionRotaryEmbedding.forward` to accept `pos_ids` tensor directly via broadcast multiply, eliminating data-dependent table creation - Make vision/audio encoder forwards accept optional precomputed tensors (`cu_seqlens`, `rotary_pos_ids`, `window_index`, `embed_indices`, etc.) - Use explicit naming: `get_vision_cu_seqlens` / `get_audio_cu_seqlens` Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix stale compute_ docstring references to match actual function names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert mlcd changes — not part of this PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix * kwargs * opt-in * fix dtype * style * guard torch import * standarize * propagate inputs * fix docs * fix docs * auto docs * more docs fixing * fix omni * fix paddle * revert paddle ocr until another time * finally fixed paddle ocr * fix review * revert chunking * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix torch compilable check * fix docs * correct func name * fix omni * fix video llama 3 * fix video llama 3 * requires torch * add missing grid device * keep rot emb in fp32 * fix test device * fix flm4v flex attention test * rename to vision utils * only one get_rotary_pos_ids is needed * style * style * deprecate only * fix * simplify and revert processor changes * renames * move some stuff to their original place * style * style * use chunked attention * use decorator * pass kwargs and return_dict * fix missing * keep in and get from kwargs * revert some trailing commas * fix * fixes * video llama fixes * fix qwen3 vl * forgot glm ocr * address comments * drop unnecessary * use correct flash attn check * missed deprecation * empty commit 1 * empty commit 2 * revert video llama 3 config changes * style * style fix * address comments * remove unnecessary * revert TransformersKwargs and add a todo --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

References

#45396 - Extract dynamic vision/audio tensors into standalone pure functions

Author

IlyasMoutawwakil

Parents

6308a20b

transformers f00940ef - Extract dynamic vision/audio tensors into standalone pure functions (#45396)

transformers
f00940ef - Extract dynamic vision/audio tensors into standalone pure functions (#45396)