[PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169)
* Loss normalisation should be invariant to number of tokens trained on, ie should not depend on the loss mask
* Make cross entropy use a microbatch independent normalisation factor in order to allow all tokens to matter as much
* Loss mask is not a boolean tensor
* Make it mergeable, ie does not change the behaviour of gpt
* Allow to run loss_on_targets_only=False for prefix lm (#179)