diffusers
7d5767db - [LTX-2.3] Add multi-modal guidance via custom guider with native transformer kwargs

Commit
4 days ago
[LTX-2.3] Add multi-modal guidance via custom guider with native transformer kwargs Add LTX2MultiModalGuidance guider that handles all 4 guidance types for LTX-2.3 audiovisual generation (CFG, STG, modality isolation, rescale) with separate video/audio scales. The guider passes per-batch transformer kwargs via _model_kwargs, keeping the denoise loop fully generic. Key changes: - New LTX2MultiModalGuidance guider (inherits BaseGuidance, not SkipLayerGuidance) with native transformer kwargs (spatio_temporal_guidance_blocks, isolate_modalities) instead of hooks - Denoise loop is now generic — no model-specific guidance code, just runs guider passes and calls guider() for the combination formula - Separate video/audio guidance scales (video cfg=3.0, audio cfg=7.0 by default) - Audio sample rate exposed from audio decoder for correct MP4 encoding - Connector processes positive/negative prompts separately (batch=1 each) to match reference — batched processing produced different self-attention results - Removed unused guiders from LTX2TextEncoderStep and LTX2ConnectorStep - Fixed SkipLayerGuidance._is_slg_enabled step range (< to <=) - Fixed sigma tensor device placement for GPU models - Updated parity-testing skill with cross-contamination rules and new pitfalls Verified pixel-identical with reference at 960x544x241, 30 steps, full guidance (CFG=3.0, STG=1.0, blocks=[28], modality=3.0, rescale=0.7, audio_cfg=7.0). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co
Parents
Loading