Use non-blocking copy for creation of lazy tensors in TS backend impl (#69397)
- blocking here makes cpu tracing thread wait for the transfer and cuda
sync, before continuing on to trace more stuff. we want to overlap
cpu tracing with doing the copy in the background
- blocking is not required if the tensor is already on the cuda device,
but it is if the tensor is on the cpu device since the cpu thread
could modify the tensor while it is being copied asynchronously.
- we make an exception for numel()=1 tensors: doing a .to (nonblocking)
cpu to cuda is potentially dangerous even for single-elem tensors,
but fill_ on cuda tensors is an async operation,
and .item() on cpu singletons tensors is fast