[dtensor] add async_op option to redistribute and some refactor (#121477)
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it
This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.
By default we set this to False, this is to follow the semantics of the c10d
collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337