[LTX-2.3] Add multi-modal guidance via custom guider with native transformer kwargs
Add LTX2MultiModalGuidance guider that handles all 4 guidance types for LTX-2.3
audiovisual generation (CFG, STG, modality isolation, rescale) with separate
video/audio scales. The guider passes per-batch transformer kwargs via _model_kwargs,
keeping the denoise loop fully generic.
Key changes:
- New LTX2MultiModalGuidance guider (inherits BaseGuidance, not SkipLayerGuidance)
with native transformer kwargs (spatio_temporal_guidance_blocks, isolate_modalities)
instead of hooks
- Denoise loop is now generic — no model-specific guidance code, just runs guider passes
and calls guider() for the combination formula
- Separate video/audio guidance scales (video cfg=3.0, audio cfg=7.0 by default)
- Audio sample rate exposed from audio decoder for correct MP4 encoding
- Connector processes positive/negative prompts separately (batch=1 each) to match
reference — batched processing produced different self-attention results
- Removed unused guiders from LTX2TextEncoderStep and LTX2ConnectorStep
- Fixed SkipLayerGuidance._is_slg_enabled step range (< to <=)
- Fixed sigma tensor device placement for GPU models
- Updated parity-testing skill with cross-contamination rules and new pitfalls
Verified pixel-identical with reference at 960x544x241, 30 steps, full guidance
(CFG=3.0, STG=1.0, blocks=[28], modality=3.0, rescale=0.7, audio_cfg=7.0).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co