[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)
In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.
Now composability workstream need a FQN based way to index optimizer state_dict for parameters..
For example, SGD optimizer might have something in its `state_dict` like:
```
{'state':
{0:
{'momentum_buffer': tensor(...)},
{1:
{'momentum_buffer': tensor(...)},
...
}
'param_groups':
[{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```
And in NamedOptimizer we want the `state_dict` can be:
```
{'state':
{'net1.0.weight':
{'momentum_buffer': tensor(...)},
{'net1.0.bias':
{'momentum_buffer': tensor(...)},
...
}
'param_groups':
[{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```
We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.
For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.
Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma