Add MiniCPM3 (#41116)
* Add MiniCPM3
Adds support for the OpenBMB MiniCPM3 architecture (e.g.
`openbmb/MiniCPM3-4B`). MiniCPM3 combines Multi-head Latent
Attention (MLA) from DeepSeek-V2 with a standard SwiGLU MLP and
three scalar scaling factors that govern signal flow:
- `scale_emb` scales input embeddings.
- `scale_depth / sqrt(num_hidden_layers)` scales residual
connections.
- `hidden_size / dim_model_base` scales hidden states before the
language-model head.
The implementation follows the modular pattern, deriving the config
and model from `LlamaConfig` / `LlamaModel`. The MLA attention
keeps the cos/sin rotary convention used by the original
implementation, registered classes are wired into the auto
mappings, and `tokenizer_auto` is updated so the upstream tokenizer
falls back to the tokenizers backend.
Verified with `openbmb/MiniCPM3-4B` end-to-end: weights load
without missing/mismatched keys and greedy generation produces a
coherent reply via `attn_implementation="eager"`.
Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com>
* Default MiniCPM3 scalings to no-ops when unspecified
`scale_depth=1.0` and `dim_model_base=1` make the residual and
logit scalings collapse, which prevented the small tester model
from training and broke the `test_training_overfit` CI job.
Mirror the original MiniCPM3 defaults: when not provided,
`scale_depth` falls back to `sqrt(num_hidden_layers)` and
`dim_model_base` falls back to `hidden_size`, so both factors
become exact no-ops. Drop the explicit overrides in
`MiniCPM3ModelTester` so the small test config inherits the new
defaults.
The real `openbmb/MiniCPM3-4B` config is unaffected: it sets
all three scalars explicitly.
Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com>
* Address review: inherit dsv2 attention, slim config, precompute scalings
- MiniCPM3Attention now inherits DeepseekV2Attention and overrides only
forward (cos/sin RoPE instead of DeepSeek-V2's complex rotary), removing
the duplicated MLA __init__. Forward output is numerically identical.
- Drop config fields that match LlamaConfig defaults; keep only the ones
that differ plus the MLA/scaling extras.
- Remove the validate_architecture override so the standard Llama
divisibility check applies, matching dsv2/dsv3.
- Precompute the residual depth scaling per layer; expose the logit
scaling as a config property; add comments highlighting the diffs from
Llama (embedding, residual, logit scalings).
- Bump copyright years to 2026; use AutoModelForCausalLM in the docs
example; drop the MoE-oriented test_all_params_have_gradient flag.
- Rework integration tests to the value-based Expectations pattern
(expected logits/text to be filled from a CI reference run).
* docs: sync MiniCPM3 model card date line with current wording
Update the date line to the current "published in HF papers on ... and
contributed to ..." format expected by utils/add_dates.py, fixing the
check_repository_consistency failure.
* Address review: align config defaults, scaled embedding, fill integration values
- Default config to the openbmb/MiniCPM3-4B checkpoint values (scale_emb=12,
scale_depth=1.4, dim_model_base=256); no-op scaling only when set to None.
- Drop redundant keys_to_ignore_at_inference (inherited from LlamaConfig).
- Move embedding scaling into MiniCPM3ScaledWordEmbedding so input_ids and
inputs_embeds paths stay consistent (fixes inputs_embeds-vs-input_ids tests).
- Remove redundant self_attn assignment in the decoder layer (modular renames it).
- Highlight the MLA RoPE diff vs DeepSeek-V2/V3 with a comment.
- Fill integration test Expectations with verified A100 (bf16) reference values.
* Skip test_sdpa_can_dispatch_on_flash for MiniCPM3 (MLA head dims)
MiniCPM3 uses MLA (inherited from DeepSeek-V2), so the query/key head dim
(qk_nope + qk_rope) differs from the value head dim. PyTorch's flash kernel
requires q, k, v to share the same last dim, so SDPA cannot dispatch on flash.
Skip the test as DeepSeek-V3 already does for the same reason.
* fixes for ci + simplify modular a bit
* Fix repo consistency: allowlist dim_model_base and refresh minicpm3 dates
---------
Co-authored-by: Aladdin Aliyev <213189260+aliyevaladddin@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>