onnxruntime
493159b4 - near-zero negative values must convert to 0 not NAN (#18473)

Commit

1 year ago

near-zero negative values must convert to 0 not NAN (#18473) for the Float8 types with unsigned zero, we must clear the sign bit when rounding to zero; otherwise we end up with 0x80 which is the encoding for NAN. ### Description Handle all zero and near-zero values the same way, rounding to positive zero. Note that I removed one "if" level but did not re-indent the code in this PR, to make it easier to see what the actual changes are. ### Motivation and Context For the two new 8-bit floating point types Float8E4M3FNUZ and Float8E5M2FNUZ, converting from a near-zero negative value would end up with the sign bit set only; this bit pattern is not negative zero but instead means NAN.

References

#18473 - near-zero negative values must convert to 0 not NAN

Author

arnej27959

Parents

605a84ff

onnxruntime 493159b4 - near-zero negative values must convert to 0 not NAN (#18473)

onnxruntime
493159b4 - near-zero negative values must convert to 0 not NAN (#18473)