[LT] Allow lazy_model.mark_step to specify a device (#72683)
Summary:
Currently this API only synchronizes tensors in the default device. This
is bad in distributed environment given there will be multiple devices.
This PR adds a paramter to the API such that caller can specify which device
to synchronize the tensors. Please refer to the attached cuda1.py for detailed
examples. This aligns with how DDP works as rank (device index) is often
passed from torch.multiprocessing and then models will need to move to
that rank first before training/inference.
Another alternative is to add an API for users to setup the default device,
which seems too verbose.
Test Plan:
Run the attached cuda1.py and observe logs suggesting that tensors are
synchronized on device Unknown1 (We should fix the log to show CUDA1).