Eliminate GPU sync overhead and CPU→GPU transfers across LTX2 pipeline (#13564)
* Remove unnecessary CUDA synchronization points and avoid CPU→GPU tensor creation
across the LTX2 pipeline, transformer, scheduler, and connector logic.
- Add set_begin_index(0) to schedulers to eliminate DtoH sync in _init_step_index
- Replace torch.tensor(..., device=...) with on-device tensor construction for decode scaling
- Move RoPE-related tensor creation to GPU to avoid memcpy overhead
- Refactor connector padding logic using vectorized masking instead of list-based ops
* Apply style fixes
* Revert low-impact CUDA synchronization changes and remove redundant `hasattr` check
---------
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>