flax
777ee543 - GPU Mixed precision training for WMT

Commit

4 years ago

GPU Mixed precision training for WMT This also fixes NaN issues in the attention module when using float16. Masking is now implemented using a select instead of adding a large negative. This avoids infinities And potential gradient leakage. In particular for float16 which has a narrow range. PiperOrigin-RevId: 395646244

References

#1414 - GPU Mixed precision training for WMT

Author

jheek

Committer

a-googler

Parents

f92f936b

flax 777ee543 - GPU Mixed precision training for WMT

flax
777ee543 - GPU Mixed precision training for WMT