[LTX-2.3] Fix cross-attn timestep, audio format, update parity-testing skill
Model fixes:
- Cross-attention timestep: always use cross-modality sigma instead of
conditional on use_cross_timestep (matching reference preprocessor which
always uses cross_modality.sigma)
- This was the root cause of the remaining 3.56 pixel diff — the diffusers
model used timestep.flatten() (2304 per-token values) instead of
audio_sigma.flatten() (1 scalar) for cross-attention modulation
Pipeline fixes:
- Per-token timestep shape (B,S) instead of (B,) for main time_embed
- f32 sigma for prompt_adaln (not bf16)
- Audio decoder: .squeeze(0).float() to match reference output format
Parity-testing skill updates:
- Add Phase 2 (optional GPU/bf16) with same capture-inject methodology
- Add 9 new pitfalls (#19-#27) from bf16 debugging
- Decode test now includes final output format (encode_video, audio)
- Add model interface mapping as required artifact from component tests
- Add test directory + lab_book setup questions
- Add example test script templates
Result: diffusers pipeline produces pixel-identical video (0.0 diff) and
bit-identical audio waveform vs reference pipeline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co