[15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846)
Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846
Approved by: https://github.com/kwen2501