Make output end up on all GPUs at the end (#2423)
* Make output end up on the cpu at the end
* Rework a bit
* Remove the CPU part
* Update to include a new util to copy tensors across devices
* Update test
* Update doc
* Update docstring
* Make False by default and change if community feedback says yes
* Apply suggestions from code review
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Update default to False in doc and make a tip
* Update typing
* Defaults
* Explain
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>