Implement rotary embeddings (#7)
* Integrate EleutherAI's version of rotary embeddings + make some small optimisation
* Add argument parser for position embeddings
* Making max-absolute-position-embeddings optional
* Move enum outside model
* Handle max_seq_len_cached better
* Fix dtype issue in rotary embeddings
* Fix tensor size
* Replace hidden_dim by hidden_size_per_attention_head
* Change all examples to new format and improve help in argparser
* Revert back changes, add comparison with position embedding type when checkpointing and replace args.max_position_embeddings with an upper bound on the sequence sizes
* Revert back changes:
- Rename max-absolute-embeddings back to max-absolute-embeddings
- Make absolute position embeddings the default
* Reformat
* Rm run.sh~ and modify back run.sh
Co-authored-by: Thomas <รถ95242+thomasw21@users.noreply.github.com>