Deal with weight tying in transformers >=5 (#2922)
While we already implemented forward compatibility with the way transformers>=5
handles weight tying, there was an issue with weight tying of trainable tokens wrappers.
Before, we simply got fixed strings of which modules are tied to the embeddings,
e.g. `"lm_head"` - this never changed since it was just a static property of the
respective PretrainedModel class. However, with the new way `get_tied_weights_keys`
is implemented, the names of the tied-to-embeddings modules change if they are
moved around. So if we wrap the `lm_head` once in a trainable tokens wrapper, it'll
become `lm_head.token_adapter.base_layer` instead of `lm_head`. That means that
the check to see if we already wrapped the tied layer needs to look at the
grand-parent instead of the target layer.
This obviously assumes that we always have a nesting level of two which is true
for TrainableTokensWrapper.