Fix LTX-2.3 pipeline quality issues and add parity notes
Essential fixes (cause quality degradation without them):
- Generate latents in model dtype (self.transformer.dtype) instead of
float32. Float32 noise introduces ~1.5e-02 quantization error vs
bfloat16 that compounds over 30 denoising steps via 1/sigma
amplification, producing washed-out output.
- Use fixed max_image_seq_len (4096) for sigma schedule shift instead
of actual video_sequence_length. The reference uses a fixed constant
(MAX_SHIFT_ANCHOR=4096); passing the real sequence length (e.g. 6144)
produces incorrect sigma schedules.
Seed-level parity (not quality, but needed for reproducibility):
- Generate noise directly in packed [B, S, D] shape to match reference
which patchifies before noise generation. Different tensor shapes
produce different RNG draws for the same seed.
Notes only (no behavioral change):
- Add NOTE about x0-space vs velocity-space guidance rounding difference
- Add NOTE about denormalize-after-noise ordering in reference
- Add commented-out code showing where denormalize would move to match
reference ordering (for future investigation)
Not part of this commit: the upstream dg845/LTX-2.3-Diffusers VAE config
has upsample_residual=[true,true,true,true] but should be [false,...].
Fix submitted as PR#1 to that repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co