Avoid 2 extra copies when reducing sparse tensors and fix result() vs inplace output discrepancy (#57822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57822
* `AsyncSparseAllreduceWork` can avoid copying output tensors, since we keep all the results alive by means of modifying input vector directly
* `AsyncSparseAllreduceWork` now returns inputs back to user instead of former behavior where it returned copies of inputs. This is consistent with other operations and process group implementations
* `AsyncSparseAllreduceCUDAWork` is now copying tensors directly from CPU to input tensors avoiding extra copy `output` -> `outputs` -> `inputs`. inputs are being returned to back to user. This is consistent with other operations and process group implementations.
overall AsyncSparseAllreduceCUDAWork is now avoiding 2 extra copies (as AsyncSparseAllreduceCUDAWork is using AsyncSparseAllreduceWork's impl)
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D28298325
Pulled By: agolynski
fbshipit-source-id: 18e2104413cdf5e73a01aad464e2613807779297