transformers
8964b4b8 - Add MiniCPM3 (#41116)

Commit
6 days ago
Add MiniCPM3 (#41116) * Add MiniCPM3 Adds support for the OpenBMB MiniCPM3 architecture (e.g. `openbmb/MiniCPM3-4B`). MiniCPM3 combines Multi-head Latent Attention (MLA) from DeepSeek-V2 with a standard SwiGLU MLP and three scalar scaling factors that govern signal flow: - `scale_emb` scales input embeddings. - `scale_depth / sqrt(num_hidden_layers)` scales residual connections. - `hidden_size / dim_model_base` scales hidden states before the language-model head. The implementation follows the modular pattern, deriving the config and model from `LlamaConfig` / `LlamaModel`. The MLA attention keeps the cos/sin rotary convention used by the original implementation, registered classes are wired into the auto mappings, and `tokenizer_auto` is updated so the upstream tokenizer falls back to the tokenizers backend. Verified with `openbmb/MiniCPM3-4B` end-to-end: weights load without missing/mismatched keys and greedy generation produces a coherent reply via `attn_implementation="eager"`. Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com> * Default MiniCPM3 scalings to no-ops when unspecified `scale_depth=1.0` and `dim_model_base=1` make the residual and logit scalings collapse, which prevented the small tester model from training and broke the `test_training_overfit` CI job. Mirror the original MiniCPM3 defaults: when not provided, `scale_depth` falls back to `sqrt(num_hidden_layers)` and `dim_model_base` falls back to `hidden_size`, so both factors become exact no-ops. Drop the explicit overrides in `MiniCPM3ModelTester` so the small test config inherits the new defaults. The real `openbmb/MiniCPM3-4B` config is unaffected: it sets all three scalars explicitly. Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com> * Address review: inherit dsv2 attention, slim config, precompute scalings - MiniCPM3Attention now inherits DeepseekV2Attention and overrides only forward (cos/sin RoPE instead of DeepSeek-V2's complex rotary), removing the duplicated MLA __init__. Forward output is numerically identical. - Drop config fields that match LlamaConfig defaults; keep only the ones that differ plus the MLA/scaling extras. - Remove the validate_architecture override so the standard Llama divisibility check applies, matching dsv2/dsv3. - Precompute the residual depth scaling per layer; expose the logit scaling as a config property; add comments highlighting the diffs from Llama (embedding, residual, logit scalings). - Bump copyright years to 2026; use AutoModelForCausalLM in the docs example; drop the MoE-oriented test_all_params_have_gradient flag. - Rework integration tests to the value-based Expectations pattern (expected logits/text to be filled from a CI reference run). * docs: sync MiniCPM3 model card date line with current wording Update the date line to the current "published in HF papers on ... and contributed to ..." format expected by utils/add_dates.py, fixing the check_repository_consistency failure. * Address review: align config defaults, scaled embedding, fill integration values - Default config to the openbmb/MiniCPM3-4B checkpoint values (scale_emb=12, scale_depth=1.4, dim_model_base=256); no-op scaling only when set to None. - Drop redundant keys_to_ignore_at_inference (inherited from LlamaConfig). - Move embedding scaling into MiniCPM3ScaledWordEmbedding so input_ids and inputs_embeds paths stay consistent (fixes inputs_embeds-vs-input_ids tests). - Remove redundant self_attn assignment in the decoder layer (modular renames it). - Highlight the MLA RoPE diff vs DeepSeek-V2/V3 with a comment. - Fill integration test Expectations with verified A100 (bf16) reference values. * Skip test_sdpa_can_dispatch_on_flash for MiniCPM3 (MLA head dims) MiniCPM3 uses MLA (inherited from DeepSeek-V2), so the query/key head dim (qk_nope + qk_rope) differs from the value head dim. PyTorch's flash kernel requires q, k, v to share the same last dim, so SDPA cannot dispatch on flash. Skip the test as DeepSeek-V3 already does for the same reason. * fixes for ci + simplify modular a bit * Fix repo consistency: allowlist dim_model_base and refresh minicpm3 dates --------- Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>
Author
Parents
Loading