Only make a shallow copy when loading optimizer state_dict (#106082)
The thing we do still deep copy is the param_groups, which is much lighter weight. This should also save memory when loading from a checkpoint.
The deepcopy was introduced in https://github.com/pytorch/pytorch/commit/ecfcf39f302f7bb193884f72ef2bd59141e5c46c, but module.py had only a shallow copy at that point so it did not actually bring parity.
Incorporates an XLA fix, which is why I'm updating the pin to https://github.com/pytorch/xla/commit/ca5eab87a71f80cd3168630511d02549cc7d2516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106082
Approved by: https://github.com/albanD, https://github.com/Skylion007