[4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations (#83810)
### About this PR
* Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Update test to validate that a separate device implementation is not supported.
### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.
Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83810
Approved by: https://github.com/kwen2501