Adding Cosmos 3 to Diffusers (#13818)

Commit

37 days ago

Adding Cosmos 3 to Diffusers (#13818) * Adding Cosmos 3 * removed dead code * Change customer TimeEmbedding Layer to DIffusers Time Embedding * removed dependency on hugging face transformers * refactor 1 * Fixed Attention Pattern * Removed from Pretrain overrides * Removing normalization from the audio Tokenizer * fixed diffusers checkpoint * fixed video save uint conversion * added forward hook for cpu offload case * removed dead params for sound tokenizer * renaming audio encoder for readability * ruff format * Fix checkpoint conversion script for sound tokenizer * Audio Decoder trim and removing some dead code * removing dead sequence packing code * refactor pipeline to diffusers style formatting * removing use of cosmos3 audio encoder * Revert "removing use of cosmos3 audio encoder" This reverts commit 1b8b99a95e5acc749b5c100637ed382de138c606. * refactor audio encoder * inline remaining sequence packing functions and lint * Removed GenerationDataClean class and Action logic * inlined default args * removed dead code and refactoring * drop pipeline-helper @no_grad, inline derive helper, move guidance check to check_inputs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop unused list bookkeeping from PackedSequence attn_modes was never read; sample_lens collapses into the existing sequence_length int (we only pack a single sample at a time); split_lens collapses into a single und_len int (only split_lens[0] was ever read). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * extract pack-time build state from PackedSequence dataclass curr, _use_mrope, _mrope_temporal_offset, _mrope_reset_spatial were transient counters used while building the joint sequence, not part of the finalized output. Thread them through _pack_*_tokens as positional args/returns so the dataclass only carries fields the pipeline actually reads back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop private-API isinstance/shape asserts These were build-mode guards on PackedSequence internals and shape checks on tensors the pipeline itself constructs, both flagged as noise in private code per reviewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build PackedSequence tensors on target device, drop to_cuda Thread device through pack_input_sequence and _pack_*_tokens helpers so all torch.tensor/zeros/arange calls land on the target device directly. Move CPU-side mRoPE tensors over with .to(device) at the append site. Pass device to finalize so list-to-tensor conversion lands on device too. Delete PackedSequence.to_cuda() and the helper _modality_to_cuda; drop the corresponding call sites in __call__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop SequencePlan, skip_text_tokens, and bos_token_id branch SequencePlan's has_text and has_vision were True at every construction site and has_sound was derivable from x0_tokens_sound is not None. condition_frame_indexes_vision is now passed directly as a List[int] arg to pack_input_sequence. Removed the skip_text_tokens flag (never True) and the dead bos_token_id shift branch in _pack_text_tokens (special_tokens never carries one in this pipeline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * use retrieve_latents helper in _encode_video Copied from stable_diffusion_img2img matching cosmos2_5 convention so make fix-copies keeps it synced. Functionally identical to the prior .latent_dist.mode() call but handles latent_dist/latents attribute variants uniformly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop get_data_and_condition + data_batch dict scaffolding normalize_video_databatch_inplace, augment_image_dim_inplace and remove_padding_from_latent were no-ops once is_preprocessed=True (always set by this pipeline) and the pipeline never pads. get_data_and_condition just orchestrated those plus a never-taken multi-vision branch. Replaced the whole chain with a few lines inline in prepare_latents: build vision_tensor on device, call _encode_video, set fps_vision. prepare_latents no longer needs input_caption_key, input_video_key, input_image_key, or the prompt kwarg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop _load_image_as_tensor; use VideoProcessor for conditioning frame Image loading is the caller's responsibility (load_image from diffusers.utils), matching the cosmos2_5 example. The pipeline registers a VideoProcessor in __init__ and calls preprocess() to resize + normalize caller-supplied PIL / np / tensor inputs to [1, 3, H, W] in [-1, 1]. prepare_latents fills the temporal dim in two lines (single frame at t=0, repeat-pad the rest) preserving the prior i2v behavior. Inference script updated to call load_image() before passing to the pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * move encode/decode helpers + transformer forward into Cosmos3OmniTransformer Pull encode_text / encode_vision / encode_sound_tokens / decode_vision / decode_sound_tokens (and their pure-tensor helpers patchify_and_pack_latents, unpatchify_and_unpack_latents, apply_timestep_embeds_to_noisy_tokens, _pack_sound_latents, _unpack_sound_latents) from the pipeline into Cosmos3OmniTransformer as methods. The transformer's forward(packed_seq) now runs the full per-step pass: encode text/vision/sound, rotary + layer loop, decode vision/sound — returns (preds_vision, preds_sound). The pipeline's CFG loop drops the encode_*/decode_* method calls and the manual und/gen split/concat; each pass is now a single self.transformer(packed_seq) call. No self.transformer.{embed_tokens, vae2llm, llm2vae, sound2llm, llm2sound, time_embedder, time_proj, sound_modality_embed} access remains in the pipeline. Cosmos3VLTextModel is kept as a structural wrapper for now — flattening it would break the published checkpoint layout. Tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * remove Cosmos3VLTextModel; flatten transformer layout embed_tokens / layers / norm / norm_moe_gen / rotary_emb are now direct attributes of Cosmos3OmniTransformer. The converter strips the leading `model.` prefix from the source language_model state-dict so new conversions land at the flat layout natively. Published Hub artifact (nvidia/Cosmos3-Nano) needs its transformer safetensors + index.json re-keyed with the same prefix strip before this code can load it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * move save_img_or_video/save_wav to cosmos/export_utils.py Per reviewer guidance, custom video/audio export helpers belong in a pipeline-local export_utils.py (mirroring pipelines/ltx2/export_utils.py) rather than living inside the pipeline file. Pipeline imports trim the now-unused pathlib/numpy/export_to_video; inference example updated to import from the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop @torch.no_grad on pipeline __call__ Diffusers pipeline convention: __call__ does not wear a torch.no_grad decorator; the responsibility for grad context sits with the caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * follow standard transformer conventions in Cosmos3OmniTransformer - Declare _no_split_modules, _repeated_blocks, _skip_layerwise_casting_patterns, _keep_in_fp32_modules, _supports_gradient_checkpointing on the transformer. - Wire self.gradient_checkpointing + the _gradient_checkpointing_func branch in forward so the flag is honest (models.md gotcha #3). - Add PeftAdapterMixin and AttentionMixin to the mixin set so LoRA loading and the attention-backend setters work. - CosmosAttnProcessor3_0 now declares _attention_backend / _parallel_config and forwards them to dispatch_attention_fn, matching the pattern in models.md and transformer_wan.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * restore @torch.no_grad() on Cosmos3OmniDiffusersPipeline.__call__ Reverting bea4eecf9 — removing the decorator causes GPU OOM during inference because the autograd graph accumulates across the full denoising loop (35 steps × dual cond/uncond passes × full transformer). pipelines.md gotcha #2 documents this exact failure mode and the convention is upheld by every sibling pipeline (pipeline_flux.py:652, pipeline_qwenimage.py:462, pipeline_wan.py:381, pipeline_ltx.py:535). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * drop training-only dummy projection paths in transformer decode helpers _decode_vision and _decode_sound both guarded a "no noisy tokens" branch that ran zeroed projections to keep the autograd graph intact. Those branches only fire when a pure-conditioning step has no MSE-loss tokens, which never happens in the inference pipeline — every workflow has at least one noisy vision frame, and _decode_sound is gated on has_sound which itself requires noisy sound tokens. Deleting per CLAUDE.md: "delete training-time code paths… only keep the inference path." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * inline modality encode/decode helpers into transformer forward models.md "Coding style": all layer calls should be visible directly in forward — avoid helper functions that hide nn.Module calls. Inlines _encode_text / _encode_vision / _encode_sound / _decode_vision / _decode_sound into forward so embed_tokens, vae2llm, sound2llm, llm2vae, llm2sound, and time_embedder are all visible at the call site. Pure- tensor helpers (_patchify_and_pack_latents, _unpatchify_and_unpack_latents, _pack_sound_latents, _unpack_sound_latents, _apply_timestep_embeds_to_noisy_tokens) stay as methods since they don't hide layer state. Also drops the inference-unreachable guards while collapsing the helpers: the "vision is None" / "sound is None" / "mse_loss_indexes.numel() > 0" branches never fire because the pipeline always packs vision, only routes to the sound branch when sound is present, and condition_frame_indexes never covers the entire stream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * trim dead flags and idioms in Cosmos3 pipeline __call__ - Drop assert config.use_moe — use_moe is never read by the model and asserts vanish under python -O; if !use_moe is unsupported the place to surface it is check_inputs, not a stripped assert. - Delete the joint_attn_implementation == "flex" path entirely (the include_end_of_generation_token branch in pack_input_sequence, the include_eog hoist, and both call-site kwargs). The published config is "two_way"; the flex branch and the end_of_generation special token were dead under every shipped checkpoint. - Drop torch._inductor.cudagraph_mark_step_begin() from the step loop — cudagraph stepping belongs in the caller's torch.compile wrapper, not the inference pipeline. - Replace four int(torch.prod(torch.tensor(shape))) idioms with math.prod(shape) — no tensor allocation, no .item() sync, and math is already imported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * introduce Cosmos3Condition for prepare_latents return shape Replace the 7-tuple return from prepare_latents with a Cosmos3Condition dataclass that carries the encoded conditioning latents (vision + optional sound), their fps tensors, the conditioning frame indices, and the num_vision_items count. The denoising loop and _postprocess_latents now read these as named attributes instead of positional tuple unpacking. Addresses reviewer thread huggingface/diffusers-new-model-addition-cosmos#1 comment 3278569263 ("create something like Cosmos3Condition class for condition input"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * restructure pack helpers as data-returning pipeline methods Replace the four module-level pack functions (_pack_text_tokens, _pack_vision_tokens, _pack_sound_tokens, pack_input_sequence) with methods on Cosmos3OmniDiffusersPipeline. The three per-segment methods now build and return their own data (text_ids/mrope_ids tuple for text, a populated ModalityData for vision/sound) instead of mutating a shared PackedSequence builder; pack_input_sequence orchestrates them. Other cleanups along the way: - Drop dead branches that the published config never exercises: use_mrope=False (model is always unified_3d_mrope), has_generation=False (always True), multi-vision-items (num_vision_items always 1), and the curr_rope_id non-mrope path. - Move the bf16 cast of per-step noisy tokens to before pack_input_sequence so the build-then-mutate pattern on packed_seq.vision.tokens disappears. - Drop the latent_patch_size / config hoists in __call__ that are now read directly inside the pack methods. Addresses reviewer threads huggingface/diffusers-new-model-addition-cosmos#1 comments 3278871807, 3278908766, 3278918514. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * collapse builder-pattern PackedSequence/ModalityData into flat dataclasses ModalityData and PackedSequence carried a list-or-tensor union for every field so they could double as builders during packing, with finalize() converting the lists to tensors at the end. Now that the pack methods each build their segment in one shot, finalize() is just a list->tensor conversion the pack methods can do themselves. - Rename ModalityData to _ModalityData (internal) with all-tensor fields (lists only for per-item entries like tokens / condition_mask). - Rename PackedSequence to Cosmos3PackedSequence and drop its finalize() method; fields are direct tensors at construction time. - _pack_vision_tokens / _pack_sound_tokens now build finalized _ModalityData directly via torch.arange / torch.tensor / torch.full. - pack_input_sequence builds the final Cosmos3PackedSequence in one return statement; no more two-stage build-then-finalize. - Update the transformer's forward docstring to reference the new name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * exclude dtype from saved transformer config via ignore_for_config ModelMixin.from_pretrained injects dtype into init_dict (configuration_utils.py:289) whenever it appears in the loader's unused_kwargs, so Cosmos3OmniTransformer.__init__ has to accept dtype. But the default @register_to_config decorator was also serializing it into config.json on every save — leaving a stray "dtype": "bfloat16" key that doesn't describe the architecture, just the load-time runtime preference. Adding ignore_for_config = ["dtype"] keeps the decorator from registering dtype while still accepting it on the init signature. New saves omit dtype; existing checkpoints that have it log a warning at load time but the value is re-injected and __init__ ignores it, so loading is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add Cosmos3 pipeline and transformer docs pages; retire JSON example inputs - New docs/source/en/api/pipelines/cosmos3.md with copy-pasteable text-to-image, text-to-video, image-to-video, and text-to-video-with-sound example snippets. The docs page is now the canonical reference for application code instead of the JSON-driven inputs/ directory. - New docs/source/en/api/models/cosmos3_omni_transformer.md describing the MoT dual-pathway architecture and showing a from_pretrained snippet. - Wire both pages into docs/source/en/_toctree.yml. - Export Cosmos3OmniDiffusersPipeline from the top-level diffusers package (matching every sibling pipeline) and add the corresponding dummy class for torch/transformers-unavailable environments. - Delete examples/cosmos3/inputs/omni/{t2i,t2v,i2v}.json and rewrite inference_cosmos3.py to take --prompt / --vision-path / --num-frames directly as CLI args. The script stays as a development smoke-test runner; canonical usage now lives in the docs. - Refresh examples/cosmos3/README.md to point at the docs page and reflect the new CLI surface. Addresses reviewer thread huggingface/diffusers-new-model-addition-cosmos#1 comment 3278258567. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * restore full prompts in cosmos3 docs from original example JSONs The text-to-video and image-to-video example prompts were shortened when porting from examples/cosmos3/inputs/omni/*.json into the docs page. Restore them verbatim from the JSONs so the docs reflect the prompts the model was actually demonstrated against, and so users copying from the docs get the same conditioning the example was tuned for. Also align the text-to-video-with-sound example: it now reuses the exact same prompt as the text-to-video block with only enable_sound=True added, instead of a hand-written waterfall prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Renamed Cosmos3 module attributes Signed-off-by: Maciej Bala <mbala@nvidia.com> * bugfix Signed-off-by: Maciej Bala <mbala@nvidia.com> * Removed unnecessary helper function; added extra comments Signed-off-by: Maciej Bala <mbala@nvidia.com> * Moved from encode_prompt to tokenize_prompt Signed-off-by: Maciej Bala <mbala@nvidia.com> * Bring back video system prompt Signed-off-by: Maciej Bala <mbala@nvidia.com> * Remove multi frame conditioning for now Signed-off-by: Maciej Bala <mbala@nvidia.com> * Removed default negative prompts from code Signed-off-by: Maciej Bala <mbala@nvidia.com> * Remove Cosmos3condition; simplify sequence pack Signed-off-by: Maciej Bala <mbala@nvidia.com> * Clean up multiple parameters Signed-off-by: Maciej Bala <mbala@nvidia.com> * Simplify decode video; remove remainings of batching Signed-off-by: Maciej Bala <mbala@nvidia.com> * simple renames Signed-off-by: Maciej Bala <mbala@nvidia.com> * Refactored schedulers for sound Signed-off-by: Maciej Bala <mbala@nvidia.com> * Remove unnecessary autocast Signed-off-by: Maciej Bala <mbala@nvidia.com> * Update sound example Signed-off-by: Maciej Bala <mbala@nvidia.com> * Simplify loops in transformer_cosmos3.py Signed-off-by: Maciej Bala <mbala@nvidia.com> * Remove unused config attributes Signed-off-by: Maciej Bala <mbala@nvidia.com> * Cleanup audio decoder Signed-off-by: Maciej Bala <mbala@nvidia.com> * Reuse encoder_video from LTX2 for Cosmos3 Signed-off-by: Maciej Bala <mbala@nvidia.com> * Fixed a few nits Signed-off-by: Maciej Bala <mbala@nvidia.com> * Moved to RMSNorm for Cosmos3 Signed-off-by: Maciej Bala <mbala@nvidia.com> * Remove meta_tensor usage Signed-off-by: Maciej Bala <mbala@nvidia.com> * Improved rope handling Signed-off-by: Maciej Bala <mbala@nvidia.com> * Improved prompt templates Signed-off-by: Maciej Bala <mbala@nvidia.com> * Added extra docs for templatete Signed-off-by: Maciej Bala <mbala@nvidia.com> * remove dataclasses * Cleanup after merging Signed-off-by: Maciej Bala <mbala@nvidia.com> * Added guardrails Signed-off-by: Maciej Bala <mbala@nvidia.com> * bugfixed guardrails Signed-off-by: Maciej Bala <mbala@nvidia.com> * Bugfix guardrails v2 Signed-off-by: Maciej Bala <mbala@nvidia.com> * Simplified input_timestep Signed-off-by: Maciej Bala <mbala@nvidia.com> * Add TODO Co-authored-by: YiYi Xu <yixu310@gmail.com> * simplify conditional mask generation * Inlined _postprocess_latents Signed-off-by: Maciej Bala <mbala@nvidia.com> * removed pack_input_sequence helper * restore export utils with deprecation warning * moved sampling rate to pipeline attribute * inlined sound and image condition mask * seperating static and timestep based sound and vision token packing * unpack transformer args * enabled selection of cosmos3 super * Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/cosmos/pipeline_cosmos3_omni.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Apply suggestions from code review Co-authored-by: YiYi Xu <yixu310@gmail.com> * fixed sound and vision conditioning use from prepare latents * ruff format and doc builder * ran fix copies * move typing to python3.10 * move special token application from pack_text_tokens to tokenize_prompt * rename packing methods to process methods * remove guidance_scale check * fix nits * respect vae dtype in the pipeline * use vae dtype for vae normalization stats * skip CFG if guidance_scale is 1 * Remove unnecessary parameter Signed-off-by: Maciej Bala <mbala@nvidia.com> * fix CFG for sound * bugfix for sound CFG * ruff format * Fix apply_chat_template return dict arg to return BatchEncoding * added option to select attention processor * docs: refresh Cosmos 3 pipeline intro Replace the terse architectural lede with the launch-style positioning (unified WFM for Physical AI, consolidating Predict/Reason/Transfer into one omni-model) and split out "What's new" and "Available checkpoints" sections so the page leads with capability rather than repo IDs. * docs: document Cosmos3OmniPipeline.__call__ arguments Add the missing Args block to Cosmos3OmniPipeline.__call__ so utils/check_forward_call_docstrings.py passes — covers all 21 parameters from prompt through enable_safety_check, plus a Returns section describing the output dataclass. * style: doc-builder reflow on Cosmos3OmniPipeline.__call__ docstring --------- Signed-off-by: Maciej Bala <mbala@nvidia.com> Co-authored-by: Yuliya Zhautouskaya <yzhautouskay@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Maciej Bala <mbala@nvidia.com> Co-authored-by: Dima Zhylko <dzhylko@nvidia.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

References

#13818 - Adding Cosmos 3 to Diffusers

Author

atharvajoshi10

Parents

ff3b86b4

diffusers a1c7df48 - Adding Cosmos 3 to Diffusers (#13818)

diffusers
a1c7df48 - Adding Cosmos 3 to Diffusers (#13818)