Implement NAdamW optimizer (#103881)
NAdamW, which is simply NAdam with the AdamW weight decay term, has shown strong performance in optimizer comparisons such as
1. https://arxiv.org/abs/2211.09760
1. https://arxiv.org/abs/2306.07179
[The VeLO paper](https://arxiv.org/abs/2211.09760) argues its power lies in its ability to act as a superset of other popular optimizers.
This PR adds NAdamW by ~~copying and making very small adaptations to the NAdam implementation (just like AdamW and Adam). To see the small changes in better detail, you can `diff torch/optim/nadam.py torch/optim/nadamw.py`.~~ adding a boolean flag `decoupled_weight_decay` that activates NAdamW behavior (`False` by default) to NAdam.
Interest in the optimizer has also been shown in the PyTorch forums:
https://discuss.pytorch.org/t/nadamw-and-demon-optimizers/179778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103881
Approved by: https://github.com/janeyx99