Megatron-DeepSpeed
6d146b5f - [PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169)

Commit
4 years ago
[PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169) * Loss normalisation should be invariant to number of tokens trained on, ie should not depend on the loss mask * Make cross entropy use a microbatch independent normalisation factor in order to allow all tokens to matter as much * Loss mask is not a boolean tensor * Make it mergeable, ie does not change the behaviour of gpt * Allow to run loss_on_targets_only=False for prefix lm (#179)
Author
Parents
Loading