Adding Cosmos 3 to Diffusers (#13818)
* Adding Cosmos 3
* removed dead code
* Change customer TimeEmbedding Layer to DIffusers Time Embedding
* removed dependency on hugging face transformers
* refactor 1
* Fixed Attention Pattern
* Removed from Pretrain overrides
* Removing normalization from the audio Tokenizer
* fixed diffusers checkpoint
* fixed video save uint conversion
* added forward hook for cpu offload case
* removed dead params for sound tokenizer
* renaming audio encoder for readability
* ruff format
* Fix checkpoint conversion script for sound tokenizer
* Audio Decoder trim and removing some dead code
* removing dead sequence packing code
* refactor pipeline to diffusers style formatting
* removing use of cosmos3 audio encoder
* Revert "removing use of cosmos3 audio encoder"
This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606.
* refactor audio encoder
* inline remaining sequence packing functions and lint
* Removed GenerationDataClean class and Action logic
* inlined default args
* removed dead code and refactoring
* drop pipeline-helper @no_grad, inline derive helper, move guidance check to check_inputs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop unused list bookkeeping from PackedSequence
attn_modes was never read; sample_lens collapses into the existing
sequence_length int (we only pack a single sample at a time); split_lens
collapses into a single und_len int (only split_lens[0] was ever read).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* extract pack-time build state from PackedSequence dataclass
curr, _use_mrope, _mrope_temporal_offset, _mrope_reset_spatial were
transient counters used while building the joint sequence, not part of
the finalized output. Thread them through _pack_*_tokens as positional
args/returns so the dataclass only carries fields the pipeline actually
reads back.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop private-API isinstance/shape asserts
These were build-mode guards on PackedSequence internals and shape
checks on tensors the pipeline itself constructs, both flagged as
noise in private code per reviewer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* build PackedSequence tensors on target device, drop to_cuda
Thread device through pack_input_sequence and _pack_*_tokens helpers
so all torch.tensor/zeros/arange calls land on the target device
directly. Move CPU-side mRoPE tensors over with .to(device) at the
append site. Pass device to finalize so list-to-tensor conversion lands
on device too. Delete PackedSequence.to_cuda() and the helper
_modality_to_cuda; drop the corresponding call sites in __call__.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop SequencePlan, skip_text_tokens, and bos_token_id branch
SequencePlan's has_text and has_vision were True at every construction
site and has_sound was derivable from x0_tokens_sound is not None.
condition_frame_indexes_vision is now passed directly as a List[int]
arg to pack_input_sequence. Removed the skip_text_tokens flag (never
True) and the dead bos_token_id shift branch in _pack_text_tokens
(special_tokens never carries one in this pipeline).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* use retrieve_latents helper in _encode_video
Copied from stable_diffusion_img2img matching cosmos2_5 convention so
make fix-copies keeps it synced. Functionally identical to the prior
.latent_dist.mode() call but handles latent_dist/latents attribute
variants uniformly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop get_data_and_condition + data_batch dict scaffolding
normalize_video_databatch_inplace, augment_image_dim_inplace and
remove_padding_from_latent were no-ops once is_preprocessed=True
(always set by this pipeline) and the pipeline never pads.
get_data_and_condition just orchestrated those plus a never-taken
multi-vision branch.
Replaced the whole chain with a few lines inline in prepare_latents:
build vision_tensor on device, call _encode_video, set fps_vision.
prepare_latents no longer needs input_caption_key, input_video_key,
input_image_key, or the prompt kwarg.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop _load_image_as_tensor; use VideoProcessor for conditioning frame
Image loading is the caller's responsibility (load_image from
diffusers.utils), matching the cosmos2_5 example. The pipeline registers
a VideoProcessor in __init__ and calls preprocess() to resize + normalize
caller-supplied PIL / np / tensor inputs to [1, 3, H, W] in [-1, 1].
prepare_latents fills the temporal dim in two lines (single frame at
t=0, repeat-pad the rest) preserving the prior i2v behavior.
Inference script updated to call load_image() before passing to the
pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* move encode/decode helpers + transformer forward into Cosmos3OmniTransformer
Pull encode_text / encode_vision / encode_sound_tokens / decode_vision /
decode_sound_tokens (and their pure-tensor helpers patchify_and_pack_latents,
unpatchify_and_unpack_latents, apply_timestep_embeds_to_noisy_tokens,
_pack_sound_latents, _unpack_sound_latents) from the pipeline into
Cosmos3OmniTransformer as methods. The transformer's forward(packed_seq)
now runs the full per-step pass: encode text/vision/sound, rotary +
layer loop, decode vision/sound — returns (preds_vision, preds_sound).
The pipeline's CFG loop drops the encode_*/decode_* method calls and
the manual und/gen split/concat; each pass is now a single
self.transformer(packed_seq) call. No self.transformer.{embed_tokens,
vae2llm, llm2vae, sound2llm, llm2sound, time_embedder, time_proj,
sound_modality_embed} access remains in the pipeline.
Cosmos3VLTextModel is kept as a structural wrapper for now — flattening
it would break the published checkpoint layout. Tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* remove Cosmos3VLTextModel; flatten transformer layout
embed_tokens / layers / norm / norm_moe_gen / rotary_emb are now direct
attributes of Cosmos3OmniTransformer. The converter strips the leading
`model.` prefix from the source language_model state-dict so new
conversions land at the flat layout natively.
Published Hub artifact (nvidia/Cosmos3-Nano) needs its transformer
safetensors + index.json re-keyed with the same prefix strip before
this code can load it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* move save_img_or_video/save_wav to cosmos/export_utils.py
Per reviewer guidance, custom video/audio export helpers belong in a
pipeline-local export_utils.py (mirroring pipelines/ltx2/export_utils.py)
rather than living inside the pipeline file. Pipeline imports trim the
now-unused pathlib/numpy/export_to_video; inference example updated to
import from the new location.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop @torch.no_grad on pipeline __call__
Diffusers pipeline convention: __call__ does not wear a torch.no_grad
decorator; the responsibility for grad context sits with the caller.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* follow standard transformer conventions in Cosmos3OmniTransformer
- Declare _no_split_modules, _repeated_blocks, _skip_layerwise_casting_patterns,
_keep_in_fp32_modules, _supports_gradient_checkpointing on the transformer.
- Wire self.gradient_checkpointing + the _gradient_checkpointing_func branch in
forward so the flag is honest (models.md gotcha #3).
- Add PeftAdapterMixin and AttentionMixin to the mixin set so LoRA loading and
the attention-backend setters work.
- CosmosAttnProcessor3_0 now declares _attention_backend / _parallel_config and
forwards them to dispatch_attention_fn, matching the pattern in models.md
and transformer_wan.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restore @torch.no_grad() on Cosmos3OmniDiffusersPipeline.__call__
Reverting bea4eecf9 — removing the decorator causes GPU OOM during inference
because the autograd graph accumulates across the full denoising loop (35
steps × dual cond/uncond passes × full transformer). pipelines.md gotcha #2
documents this exact failure mode and the convention is upheld by every
sibling pipeline (pipeline_flux.py:652, pipeline_qwenimage.py:462,
pipeline_wan.py:381, pipeline_ltx.py:535).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* drop training-only dummy projection paths in transformer decode helpers
_decode_vision and _decode_sound both guarded a "no noisy tokens" branch
that ran zeroed projections to keep the autograd graph intact. Those
branches only fire when a pure-conditioning step has no MSE-loss tokens,
which never happens in the inference pipeline — every workflow has at
least one noisy vision frame, and _decode_sound is gated on has_sound
which itself requires noisy sound tokens. Deleting per CLAUDE.md:
"delete training-time code paths… only keep the inference path."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* inline modality encode/decode helpers into transformer forward
models.md "Coding style": all layer calls should be visible directly in
forward — avoid helper functions that hide nn.Module calls. Inlines
_encode_text / _encode_vision / _encode_sound / _decode_vision /
_decode_sound into forward so embed_tokens, vae2llm, sound2llm, llm2vae,
llm2sound, and time_embedder are all visible at the call site. Pure-
tensor helpers (_patchify_and_pack_latents, _unpatchify_and_unpack_latents,
_pack_sound_latents, _unpack_sound_latents, _apply_timestep_embeds_to_noisy_tokens)
stay as methods since they don't hide layer state.
Also drops the inference-unreachable guards while collapsing the helpers:
the "vision is None" / "sound is None" / "mse_loss_indexes.numel() > 0"
branches never fire because the pipeline always packs vision, only routes
to the sound branch when sound is present, and condition_frame_indexes
never covers the entire stream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* trim dead flags and idioms in Cosmos3 pipeline __call__
- Drop assert config.use_moe — use_moe is never read by the model and
asserts vanish under python -O; if !use_moe is unsupported the place
to surface it is check_inputs, not a stripped assert.
- Delete the joint_attn_implementation == "flex" path entirely (the
include_end_of_generation_token branch in pack_input_sequence, the
include_eog hoist, and both call-site kwargs). The published config
is "two_way"; the flex branch and the end_of_generation special token
were dead under every shipped checkpoint.
- Drop torch._inductor.cudagraph_mark_step_begin() from the step loop —
cudagraph stepping belongs in the caller's torch.compile wrapper, not
the inference pipeline.
- Replace four int(torch.prod(torch.tensor(shape))) idioms with
math.prod(shape) — no tensor allocation, no .item() sync, and math
is already imported.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* introduce Cosmos3Condition for prepare_latents return shape
Replace the 7-tuple return from prepare_latents with a Cosmos3Condition
dataclass that carries the encoded conditioning latents (vision + optional
sound), their fps tensors, the conditioning frame indices, and the
num_vision_items count. The denoising loop and _postprocess_latents now
read these as named attributes instead of positional tuple unpacking.
Addresses reviewer thread huggingface/diffusers-new-model-addition-cosmos#1
comment 3278569263 ("create something like Cosmos3Condition class for
condition input").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restructure pack helpers as data-returning pipeline methods
Replace the four module-level pack functions (_pack_text_tokens,
_pack_vision_tokens, _pack_sound_tokens, pack_input_sequence) with
methods on Cosmos3OmniDiffusersPipeline. The three per-segment methods
now build and return their own data (text_ids/mrope_ids tuple for text,
a populated ModalityData for vision/sound) instead of mutating a shared
PackedSequence builder; pack_input_sequence orchestrates them.
Other cleanups along the way:
- Drop dead branches that the published config never exercises:
use_mrope=False (model is always unified_3d_mrope), has_generation=False
(always True), multi-vision-items (num_vision_items always 1), and
the curr_rope_id non-mrope path.
- Move the bf16 cast of per-step noisy tokens to before pack_input_sequence
so the build-then-mutate pattern on packed_seq.vision.tokens disappears.
- Drop the latent_patch_size / config hoists in __call__ that are now
read directly inside the pack methods.
Addresses reviewer threads
huggingface/diffusers-new-model-addition-cosmos#1 comments 3278871807,
3278908766, 3278918514.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* collapse builder-pattern PackedSequence/ModalityData into flat dataclasses
ModalityData and PackedSequence carried a list-or-tensor union for every
field so they could double as builders during packing, with finalize()
converting the lists to tensors at the end. Now that the pack methods
each build their segment in one shot, finalize() is just a list->tensor
conversion the pack methods can do themselves.
- Rename ModalityData to _ModalityData (internal) with all-tensor fields
(lists only for per-item entries like tokens / condition_mask).
- Rename PackedSequence to Cosmos3PackedSequence and drop its finalize()
method; fields are direct tensors at construction time.
- _pack_vision_tokens / _pack_sound_tokens now build finalized
_ModalityData directly via torch.arange / torch.tensor / torch.full.
- pack_input_sequence builds the final Cosmos3PackedSequence in one
return statement; no more two-stage build-then-finalize.
- Update the transformer's forward docstring to reference the new name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* exclude dtype from saved transformer config via ignore_for_config
ModelMixin.from_pretrained injects dtype into init_dict
(configuration_utils.py:289) whenever it appears in the loader's
unused_kwargs, so Cosmos3OmniTransformer.__init__ has to accept dtype.
But the default @register_to_config decorator was also serializing it
into config.json on every save — leaving a stray "dtype": "bfloat16"
key that doesn't describe the architecture, just the load-time runtime
preference.
Adding ignore_for_config = ["dtype"] keeps the decorator from
registering dtype while still accepting it on the init signature.
New saves omit dtype; existing checkpoints that have it log a
warning at load time but the value is re-injected and __init__
ignores it, so loading is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* add Cosmos3 pipeline and transformer docs pages; retire JSON example inputs
- New docs/source/en/api/pipelines/cosmos3.md with copy-pasteable
text-to-image, text-to-video, image-to-video, and text-to-video-with-sound
example snippets. The docs page is now the canonical reference for
application code instead of the JSON-driven inputs/ directory.
- New docs/source/en/api/models/cosmos3_omni_transformer.md describing the
MoT dual-pathway architecture and showing a from_pretrained snippet.
- Wire both pages into docs/source/en/_toctree.yml.
- Export Cosmos3OmniDiffusersPipeline from the top-level diffusers package
(matching every sibling pipeline) and add the corresponding dummy class
for torch/transformers-unavailable environments.
- Delete examples/cosmos3/inputs/omni/{t2i,t2v,i2v}.json and rewrite
inference_cosmos3.py to take --prompt / --vision-path / --num-frames
directly as CLI args. The script stays as a development smoke-test
runner; canonical usage now lives in the docs.
- Refresh examples/cosmos3/README.md to point at the docs page and
reflect the new CLI surface.
Addresses reviewer thread
huggingface/diffusers-new-model-addition-cosmos#1 comment 3278258567.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* restore full prompts in cosmos3 docs from original example JSONs
The text-to-video and image-to-video example prompts were shortened
when porting from examples/cosmos3/inputs/omni/*.json into the docs
page. Restore them verbatim from the JSONs so the docs reflect the
prompts the model was actually demonstrated against, and so users
copying from the docs get the same conditioning the example was tuned
for.
Also align the text-to-video-with-sound example: it now reuses the
exact same prompt as the text-to-video block with only enable_sound=True
added, instead of a hand-written waterfall prompt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Renamed Cosmos3 module attributes
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* bugfix
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Removed unnecessary helper function; added extra comments
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Moved from encode_prompt to tokenize_prompt
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Bring back video system prompt
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove multi frame conditioning for now
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Removed default negative prompts from code
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove Cosmos3condition; simplify sequence pack
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Clean up multiple parameters
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplify decode video; remove remainings of batching
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* simple renames
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Refactored schedulers for sound
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove unnecessary autocast
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Update sound example
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplify loops in transformer_cosmos3.py
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove unused config attributes
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Cleanup audio decoder
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Reuse encoder_video from LTX2 for Cosmos3
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Fixed a few nits
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Moved to RMSNorm for Cosmos3
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Remove meta_tensor usage
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Improved rope handling
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Improved prompt templates
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Added extra docs for templatete
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* remove dataclasses
* Cleanup after merging
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Added guardrails
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* bugfixed guardrails
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Bugfix guardrails v2
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Simplified input_timestep
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* Add TODO
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* simplify conditional mask generation
* Inlined _postprocess_latents
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* removed pack_input_sequence helper
* restore export utils with deprecation warning
* moved sampling rate to pipeline attribute
* inlined sound and image condition mask
* seperating static and timestep based sound and vision token packing
* unpack transformer args
* enabled selection of cosmos3 super
* Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* Apply suggestions from code review
Co-authored-by: YiYi Xu <yixu310@gmail.com>
* fixed sound and vision conditioning use from prepare latents
* ruff format and doc builder
* ran fix copies
* move typing to python3.10
* move special token application from pack_text_tokens to tokenize_prompt
* rename packing methods to process methods
* remove guidance_scale check
* fix nits
* respect vae dtype in the pipeline
* use vae dtype for vae normalization stats
* skip CFG if guidance_scale is 1
* Remove unnecessary parameter
Signed-off-by: Maciej Bala <mbala@nvidia.com>
* fix CFG for sound
* bugfix for sound CFG
* ruff format
* Fix apply_chat_template return dict arg to return BatchEncoding
* added option to select attention processor
* docs: refresh Cosmos 3 pipeline intro
Replace the terse architectural lede with the launch-style positioning
(unified WFM for Physical AI, consolidating Predict/Reason/Transfer
into one omni-model) and split out "What's new" and "Available
checkpoints" sections so the page leads with capability rather than
repo IDs.
* docs: document Cosmos3OmniPipeline.__call__ arguments
Add the missing Args block to Cosmos3OmniPipeline.__call__ so
utils/check_forward_call_docstrings.py passes — covers all 21
parameters from prompt through enable_safety_check, plus a Returns
section describing the output dataclass.
* style: doc-builder reflow on Cosmos3OmniPipeline.__call__ docstring
---------
Signed-off-by: Maciej Bala <mbala@nvidia.com>
Co-authored-by: Yuliya Zhautouskaya <yzhautouskay@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Maciej Bala <mbala@nvidia.com>
Co-authored-by: Dima Zhylko <dzhylko@nvidia.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>