Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.
This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.
There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260