Fix distributed documentation for asynchronous collective Work objects (#45709)
Summary:
Closes https://github.com/pytorch/pytorch/issues/42247. Clarifies some documentation related to `Work` object semantics (outputs of async collective functions). Clarifies the difference between CPU operations and CUDA operations (on Gloo or NCCL backend), and provides an example where the difference in CUDA operation's wait() semantics is necessary to understand for correct code.
![sync](https://user-images.githubusercontent.com/8039770/94875710-6f64e780-040a-11eb-8fb5-e94fd53534e5.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45709
Reviewed By: ngimel
Differential Revision: D24171256
Pulled By: rohan-varma
fbshipit-source-id: 6365a569ef477b59eb2ac0a8a9a1c1f34eb60e22