[optim] Improve adadelta foreach, group tensors to maximize fast path (#92048)
Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device.
This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf.
Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048
Approved by: https://github.com/albanD