[LTX-2.3] Fix bf16 parity between diffusers and reference implementation
Seven fixes to achieve bit-identical output between the diffusers LTX-2.3
pipeline and the reference Lightricks/LTX-2 implementation in bf16/GPU:
1. encode_video: use truncation (.astype) instead of .round() for float→uint8,
matching the reference's .to(torch.uint8) behavior
2. Scheduler sigma computation: compute time_shift and stretch_shift_to_terminal
in torch float32 instead of numpy float64 to match reference precision
3. Initial sigmas: use torch.linspace (float32) instead of np.linspace (float64)
to produce bit-identical sigma schedules
4. CFG formula: use reference formula cond + (scale-1)*(cond-uncond) instead of
uncond + scale*(cond-uncond) to match bf16 arithmetic order
5. Euler step: upcast model_output to sample dtype before multiplying by dt,
avoiding bf16 precision loss from 0-dim tensor type promotion rules
6. x0→velocity division: use sigma.item() (Python float) instead of 0-dim tensor,
matching reference's to_velocity which uses sigma.item() internally
7. RoPE: remove float32 upcast in apply_interleaved_rotary_emb and
apply_split_rotary_emb, cast cos/sin to input dtype instead — reference
computes RoPE in model dtype (bf16) without upcasting
Also updates RMSNorm to use torch.nn.functional.rms_norm for consistency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
yiyi@huggingface.co