diffusers
c3c9555d - [LTX-2.3] Fix cross-attn timestep, audio format, update parity-testing skill

Commit
4 days ago
[LTX-2.3] Fix cross-attn timestep, audio format, update parity-testing skill Model fixes: - Cross-attention timestep: always use cross-modality sigma instead of conditional on use_cross_timestep (matching reference preprocessor which always uses cross_modality.sigma) - This was the root cause of the remaining 3.56 pixel diff — the diffusers model used timestep.flatten() (2304 per-token values) instead of audio_sigma.flatten() (1 scalar) for cross-attention modulation Pipeline fixes: - Per-token timestep shape (B,S) instead of (B,) for main time_embed - f32 sigma for prompt_adaln (not bf16) - Audio decoder: .squeeze(0).float() to match reference output format Parity-testing skill updates: - Add Phase 2 (optional GPU/bf16) with same capture-inject methodology - Add 9 new pitfalls (#19-#27) from bf16 debugging - Decode test now includes final output format (encode_video, audio) - Add model interface mapping as required artifact from component tests - Add test directory + lab_book setup questions - Add example test script templates Result: diffusers pipeline produces pixel-identical video (0.0 diff) and bit-identical audio waveform vs reference pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co
Parents
Loading