[`Ernie 4.5`] Ernie VL models (#39585)

Commit

2 days ago

[`Ernie 4.5`] Ernie VL models (#39585) * more attention cleanup * llama like text attention * generates different text but cos and sin tensors are always close - 1e-8 * another round of rope fixups * yea, gonna check tomorrow cant cheat w freqs for whatever reason * NOTE: last time where comp with old rope * rope cleanup * more rope * somewhat clean 3d rope with attn - sin / cos has very small diffs to original formula (torch.allclose always True) leading to slightly different generations * new rope type * style * attempt at moe, gonna need a deeper look * cleanup gate * more cleaning * NOTE remove attempt at moe for now * another round of cleanups * whoops * we back boys, reattempting moe start * moe should be done with this * cleanup * more cleanup * nits * add conversion and adjust code accordingly * fix * make moe copyable as far as we can * cleanup conversion a bit, next config * cleanup config part1 * small removal of unused things * config conversion, rope type doesnt get loaded tho... * fix rope * last hardcoded values * remove unnecessary class * starting to make copies available for vision, vision rope refactor tomorrow * vl rope changes * simplify variable resolution resampler * nit * conversion update * more conversions, standardization, and big dtype fix! * remove some docs (tmp), focus on code for me * oops * nit * fixup embeddings, add todos * more cleanup * more cleanup, next caching changes * revert fp16, internally discussed weights are supposed to be bf16 * fix rope (a bit), prepare cache logic changes * more prep for cache * cache class is used, fixup some flags * modular refactor * partially docstrings, docs, etc * cleaner order * nit * fix config * remove old artefacts/todos * sync with remote and add some todos for orientation * remove img process dep on modeling code * image processor with a few diffs highlighted to copy from maybe * fast img processor version * modular image processors * convert tokenizer to have dedicated video placeholder token * before i forget * a modular bug :/ * more processor things, some modular adjustments * remove dependency on token type ids * position ids ala qwen vl and modular is bugging * fixup some inheritances + nits * token type ids * moe loss, docs, simplify pos ids * align some feature getters * docs * rename conv -> merge aka our naming convention * style * fixup tokenizer class in auto * no more nn sequential * fix chat template, fix tokenizer conversion, modular bug * remove this * remove old deps (from the remote processor) * whoops * argh * todo, restarting progress tomorrow * fast image processor changes output, keeping slow for now * NOTE rm debugging code on processor conversion * first complete conversion script version, todo on whether to use fast processor * config docs * image processor tests, only kept to images as videos need different resolutions * processor tests * first ish version for video processor, very much WIP tho * sync with main and all the changes that happened, fix ernie moe bug in dtype casting * mini style fix * vid processor is properly separated now * make vid processor its own thing * style * video processing and cleanups, img processing done, processing needs one TODO, vid processing needs tests * readd vid patch fn * make 4D RoPE possible if manually passed * simplify the msg on packing, allow external prep but not internal one * nit * revert general changes video utils, make it specific to ernie, fixup tests * vid to auto * left to check: pos ids (rope) + token type ids * move token type ids to processor, fix processor to ernie logic TODOs: tests, tests, tests * processor fixes, conversion todo for fast img processor TODOs: tests for vid processor and modeling * fix * video processor tests, torch compile does not work due to PIL drawing being needed * fix config consistency * style * wip tests * fix most tests, 2 failing ones remain * fix last tests * check * docs consistency * fix conversion script, more docs * optional drawing on frames, style * add error on compile x draw on frames * fix * fix * change font loading to hub dep with default font * fix config try 2 * fix diff resolution, tests (not fast processor, a100) * fix test * style * torch 2.9 (fa2 untested, video from 2.6) * raushan's review (part 1) * Update docs/source/en/model_doc/ernie4_5_vl.md Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> * Pablo's review * style * fix device/dtype stuff that is no longer needed * revert vision property rm, necessary for composite sdpa test * fixup few smaller things + refactor how we load the font entirely (based on font name with expected associated file at same repo) * remove bc min max pixels --> less modular on processor parts but way cleaner code * fix fps and add fixme to the inefficient conversion stuff * rope * style * copies and last rope stuff i fogot * revert glm4v copies * fix * simplify temporal slicing and add more descriptions * that ":" :cry: * fixup init * conversion for moe split and merge + general renamings etc -- encountering OOM (automap maybe?) * wrong order whoops * style * copies * fix init * fix * fix * allow the resolved path to be passed to explicit video processor classes and refactor how we load them for ernie * simplify * shoot, I need it there as well * better err handling * style * initial fixes after merge * working loading version * cleanup * change moe order and fix vl version * reverse op is mapping incorrectly TODO * reverse loading somewhat works, name conversion has issues it seems :eyes: * fix renaming issue, slow tests pass (except the integration ones ~ expected due to fused weights) * conversion mapping with native features + remove conversion mapping restriction * add test for new conversion * style * update conversion * fix integration tests, remove fa tests * fix * update docs a bit * style * fix ernie moe and routing ernie series * style * fix rope warning * i fucked up again pain * update expectations * remove EP, broken atm be it sole or in combination with TP * update docs a bit * first part of addressing review comments * fixup * fix vid processor * fix font saving * readd decorators oops * add mm token type id shortcut * always compose mm token type ids if needed * move config to modular * fix loading by enforcing correct order * fix * address first bunch of comments * smaller comments * let's make moe layer types, ill fix modular in a second * modular * style * renamed version along a few fixes in conversion and processor tests * fix * style + decorator * fix tokenizer handling of additional special tokens * style * fix doc refs * test fix * fix * was this too breaking? * fix conversion via workaround for now * post merge fix * revert a few tok things (additional_special_tokens), updated conversion * fix video processing loading logic add exception for auto class (reload config as we have a circular dep on finding which class we have, i.e. we need to load to find the class then load with specific logic) remove some original ideas * style * processor path change * add small dummy integration tests * style * fix rope modeling to follow qwen2 vl instead + change auto loading to specifically load via pretrained (overridable from pretrained for auto classes) * seems to be skipped in other similar vlms * small conversion updates and adjust max vram usage during the big integration test * update test paths * style * style attmpt 2 * docs * trigger ci * review * post merge fixes * fix * safety * fix test * style * oops * fix * ... * simplify the config init for moe pattern * gonna be fixed by #42963 --------- Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

References

#39585 - [`Ernie 4.5`] Ernie VL models

Author

vasqu

Parents

f218ed21

transformers a8a22624 - [`Ernie 4.5`] Ernie VL models (#39585)

transformers
a8a22624 - [`Ernie 4.5`] Ernie VL models (#39585)