llvm-project
7d5d4db8 - [X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations (#170069)

Commit
71 days ago
[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations (#170069) I got somewhat nerd-sniped when looking at a Rust issue and seeing [this comment about how various min/max operations are compiled on various architectures](https://github.com/rust-lang/rust/issues/91079#issuecomment-3592393680). The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM *really* prefers to lower them to branches, to the point of scalarizing vector code to do so, even if `blendv` is supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a couple `blendv` operations in a row.) It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties: - When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number). - When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number). We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows: - For taking the minimum: - Call `minps`/`minpd`/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand. - Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit. - Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this. - Analogously, for taking the maximum: - Call `maxps`/`maxpd`/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand. - Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit. - Finally, AND that with the maximum from earlier. In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference [allows NaNs to have a nondeterministic sign bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's already a step after this that selects one of the input NaNs anyway. [Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It obviously can't verify that the implementation is sound, but shows that at least the theory is. I believe this approach is faster than even properly-vectorized `blendv` operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single `vpternlogd`. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident this is at least not a regression.
Author
Parents
Loading