[`Ernie 4.5`] Ernie VL models (#39585)
* more attention cleanup
* llama like text attention
* generates different text but cos and sin tensors are always close - 1e-8
* another round of rope fixups
* yea, gonna check tomorrow cant cheat w freqs for whatever reason
* NOTE: last time where comp with old rope
* rope cleanup
* more rope
* somewhat clean 3d rope with attn - sin / cos has very small diffs to original formula (torch.allclose always True) leading to slightly different generations
* new rope type
* style
* attempt at moe, gonna need a deeper look
* cleanup gate
* more cleaning
* NOTE remove attempt at moe for now
* another round of cleanups
* whoops
* we back boys, reattempting moe start
* moe should be done with this
* cleanup
* more cleanup
* nits
* add conversion and adjust code accordingly
* fix
* make moe copyable as far as we can
* cleanup conversion a bit, next config
* cleanup config part1
* small removal of unused things
* config conversion, rope type doesnt get loaded tho...
* fix rope
* last hardcoded values
* remove unnecessary class
* starting to make copies available for vision, vision rope refactor tomorrow
* vl rope changes
* simplify variable resolution resampler
* nit
* conversion update
* more conversions, standardization, and big dtype fix!
* remove some docs (tmp), focus on code for me
* oops
* nit
* fixup embeddings, add todos
* more cleanup
* more cleanup, next caching changes
* revert fp16, internally discussed weights are supposed to be bf16
* fix rope (a bit), prepare cache logic changes
* more prep for cache
* cache class is used, fixup some flags
* modular refactor
* partially docstrings, docs, etc
* cleaner order
* nit
* fix config
* remove old artefacts/todos
* sync with remote and add some todos for orientation
* remove img process dep on modeling code
* image processor with a few diffs highlighted to copy from maybe
* fast img processor version
* modular image processors
* convert tokenizer to have dedicated video placeholder token
* before i forget
* a modular bug :/
* more processor things, some modular adjustments
* remove dependency on token type ids
* position ids ala qwen vl and modular is bugging
* fixup some inheritances + nits
* token type ids
* moe loss, docs, simplify pos ids
* align some feature getters
* docs
* rename conv -> merge aka our naming convention
* style
* fixup tokenizer class in auto
* no more nn sequential
* fix chat template, fix tokenizer conversion, modular bug
* remove this
* remove old deps (from the remote processor)
* whoops
* argh
* todo, restarting progress tomorrow
* fast image processor changes output, keeping slow for now
* NOTE rm debugging code on processor conversion
* first complete conversion script version, todo on whether to use fast processor
* config docs
* image processor tests, only kept to images as videos need different resolutions
* processor tests
* first ish version for video processor, very much WIP tho
* sync with main and all the changes that happened, fix ernie moe bug in dtype casting
* mini style fix
* vid processor is properly separated now
* make vid processor its own thing
* style
* video processing and cleanups, img processing done, processing needs one TODO, vid processing needs tests
* readd vid patch fn
* make 4D RoPE possible if manually passed
* simplify the msg on packing, allow external prep but not internal one
* nit
* revert general changes video utils, make it specific to ernie, fixup tests
* vid to auto
* left to check: pos ids (rope) + token type ids
* move token type ids to processor, fix processor to ernie logic
TODOs: tests, tests, tests
* processor fixes, conversion todo for fast img processor
TODOs: tests for vid processor and modeling
* fix
* video processor tests, torch compile does not work due to PIL drawing being needed
* fix config consistency
* style
* wip tests
* fix most tests, 2 failing ones remain
* fix last tests
* check
* docs consistency
* fix conversion script, more docs
* optional drawing on frames, style
* add error on compile x draw on frames
* fix
* fix
* change font loading to hub dep with default font
* fix config try 2
* fix diff resolution, tests (not fast processor, a100)
* fix test
* style
* torch 2.9 (fa2 untested, video from 2.6)
* raushan's review (part 1)
* Update docs/source/en/model_doc/ernie4_5_vl.md
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
* Pablo's review
* style
* fix device/dtype stuff that is no longer needed
* revert vision property rm, necessary for composite sdpa test
* fixup few smaller things + refactor how we load the font entirely (based on font name with expected associated file at same repo)
* remove bc min max pixels --> less modular on processor parts but way cleaner code
* fix fps and add fixme to the inefficient conversion stuff
* rope
* style
* copies and last rope stuff i fogot
* revert glm4v copies
* fix
* simplify temporal slicing and add more descriptions
* that ":" :cry:
* fixup init
* conversion for moe split and merge + general renamings etc -- encountering OOM (automap maybe?)
* wrong order whoops
* style
* copies
* fix init
* fix
* fix
* allow the resolved path to be passed to explicit video processor classes and refactor how we load them for ernie
* simplify
* shoot, I need it there as well
* better err handling
* style
* initial fixes after merge
* working loading version
* cleanup
* change moe order and fix vl version
* reverse op is mapping incorrectly TODO
* reverse loading somewhat works, name conversion has issues it seems :eyes:
* fix renaming issue, slow tests pass (except the integration ones ~ expected due to fused weights)
* conversion mapping with native features + remove conversion mapping restriction
* add test for new conversion
* style
* update conversion
* fix integration tests, remove fa tests
* fix
* update docs a bit
* style
* fix ernie moe and routing ernie series
* style
* fix rope warning
* i fucked up again pain
* update expectations
* remove EP, broken atm be it sole or in combination with TP
* update docs a bit
* first part of addressing review comments
* fixup
* fix vid processor
* fix font saving
* readd decorators oops
* add mm token type id shortcut
* always compose mm token type ids if needed
* move config to modular
* fix loading by enforcing correct order
* fix
* address first bunch of comments
* smaller comments
* let's make moe layer types, ill fix modular in a second
* modular
* style
* renamed version along a few fixes in conversion and processor tests
* fix
* style + decorator
* fix tokenizer handling of additional special tokens
* style
* fix doc refs
* test fix
* fix
* was this too breaking?
* fix conversion via workaround for now
* post merge fix
* revert a few tok things (additional_special_tokens), updated conversion
* fix video processing loading logic
add exception for auto class (reload config as we have a circular dep on finding which class we have, i.e. we need to load to find the class then load with specific logic)
remove some original ideas
* style
* processor path change
* add small dummy integration tests
* style
* fix rope modeling to follow qwen2 vl instead + change auto loading to specifically load via pretrained (overridable from pretrained for auto classes)
* seems to be skipped in other similar vlms
* small conversion updates and adjust max vram usage during the big integration test
* update test paths
* style
* style attmpt 2
* docs
* trigger ci
* review
* post merge fixes
* fix
* safety
* fix test
* style
* oops
* fix
* ...
* simplify the config init for moe pattern
* gonna be fixed by #42963
---------
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>