Add modular_esmc.py; generate modeling_esmc.py from it
ESMC now follows the modular convention. modular_esmc.py is the source of
truth; modeling_esmc.py is generated by utils/modular_model_converter.py and
carries the auto-generated header.
Reuse from esm (the natural parent — also a bidirectional protein encoder):
`eager_attention_forward`, `rotate_half`, and `apply_rotary_pos_emb` are now
imported from ..esm.modeling_esm and inlined into the generated file with
`# Copied from` headers (so they stay in sync). `rotate_half` is pulled in
transitively as a dependency of `apply_rotary_pos_emb`, matching the qwen3
pattern.
Everything else stays ESMC-specific and is defined in the modular file: the
SAE-integrated ESMCModel + ForMaskedLM/SequenceClassification/
TokenClassification, the fused-LN MultiHeadAttention, SwiGLU FFN,
TransformerStack, ESMCRotaryEmbedding, and the SAE-carrying output
dataclasses. As expected for this architecture the dedup is modest; the win
is convention compliance + auto-sync of the shared functions.
The modular file was ruff-fixed/formatted (Optional[X] -> X | None, import
order) before regeneration, so both files are now ruff-clean.
Verified: `check_modular_conversion.py` passes (files in sync); `transformers`
imports; and loading identical weights reproduces the pre-conversion
last_hidden_state bit-for-bit (0.0) at all valid positions for plain,
padding-mask, and multi-chain inputs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>