Fix deepspeed prefix-lm (#107)
* Fix pretrain prefix lm using deepspeed
* Fix: self._args to args
* First set attn_mask in model and then build model
* Fix: enforce that we pass down tuple instead of generator
* Attention mask does not need to be transposed
* BIGGEST HACK EVER
* Remove BIGGEST HACK
* Skip prefix test as PP>1 doesn't work yet on deepspeed
* Unskip prefix test
* Merge branch 'main' into thomas/fix_deepspeed_prefix