[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations (#170069)
I got somewhat nerd-sniped when looking at a Rust issue and seeing [this
comment about how various min/max operations are compiled on various
architectures](https://github.com/rust-lang/rust/issues/91079#issuecomment-3592393680).
The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very
branchy because of the signed-zero handling. Even though we emit select
operations, LLVM *really* prefers to lower them to branches, to the
point of scalarizing vector code to do so, even if `blendv` is
supported. (Should I open a separate issue for that? It seems concerning
that LLVM would rather scalarize a vector operation than emit a couple
`blendv` operations in a row.)
It turns out that handling signed zero operands properly can be done
using a couple bitwise operations, which is branchless and easily
vectorizable, by taking advantage of the following properties:
- When you take the maximum of two floats, the output sign bit will be
the bitwise AND of the input sign bits (since 0 means positive, and the
maximum always prefers the positive number).
- When you take the minimum of two floats, the output sign bit will be
the bitwise OR of the input sign bits (since 1 means negative, and the
minimum always prefers the negative number).
We can further optimize this by taking advantage of the fact that x86's
min/max instructions operate like a floating-point compare+select,
returning the second operand if both are (positive or negative) zero.
Altogether, the operations go as follows:
- For taking the minimum:
- Call `minps`/`minpd`/etc. on the input operands. This will return the
minimum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise AND of the first operand and the highest bit, so that
everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect
case was when the second operand was +0.0 and the first operand was
-0.0. By OR-ing the first operand's sign bit with the existing minimum,
we correct this.
- Analogously, for taking the maximum:
- Call `maxps`/`maxpd`/etc. on the input operands. This will return the
maximum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise OR of the first operand and a bit pattern which is
all ones except for the highest bit, so that everything is ones except
the sign bit.
- Finally, AND that with the maximum from earlier.
In the case of NaNs, this approach might change the output NaN's sign
bit. We don't have to worry about this for a couple reasons: firstly,
LLVM's language reference [allows NaNs to have a nondeterministic sign
bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's
already a step after this that selects one of the input NaNs anyway.
[Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It
obviously can't verify that the implementation is sound, but shows that
at least the theory is.
I believe this approach is faster than even properly-vectorized `blendv`
operations because it eliminates a data dependency chain. Furthermore on
AVX-512, the load, AND, and OR can become a single `vpternlogd`. My
highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so
SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident
this is at least not a regression.