Megatron-DeepSpeed
6d146b5f - [PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169)

Commit

4 years ago

[PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169) * Loss normalisation should be invariant to number of tokens trained on, ie should not depend on the loss mask * Make cross entropy use a microbatch independent normalisation factor in order to allow all tokens to matter as much * Loss mask is not a boolean tensor * Make it mergeable, ie does not change the behaviour of gpt * Allow to run loss_on_targets_only=False for prefix lm (#179)

References

#169 - [PrefixLM] Figuring out why prefix lm is doing poorly on short context

Author

thomasw21

Parents

590f3e27

Megatron-DeepSpeed 6d146b5f - [PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169)

Megatron-DeepSpeed
6d146b5f - [PrefixLM] Figuring out why prefix lm is doing poorly on short context (#169)