implement faster floating-point `isless` (#39090)
* implement faster floating-point `isless`
Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:
1. The NaN-check compiles to a single instruction + branch amenable
for branch prediction in arguably most usecases (i.e. comparing
non-NaN floats), thus speeding up execution.
2. The compiler now often manages to remove NaN-computation if the
embedding code has already proven the arguments to be non-NaN.
3. The actual operation compares both arguments as sign-magnitude
integers instead of case analysis based on the sign of one
argument. This symmetric treatment may generate vectorized
instructions for the sign-magnitude conversion depending on how the
arguments are layed out.
The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.
While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.
Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:
```julia
a = rand(1000);
@btime sort($a, lt=(a,b)->isless(a,b));
# before: 56.030 μs (1 allocation: 7.94 KiB)
# after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@btime sort($a, lt=(a,b)->isless(a,b));
# before: 159.499 ms (2 allocations: 7.63 MiB)
# after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@btime sort($a, lt=(a,b)->isless(a,b));
# before: 111.925 ms (2 allocations: 7.63 MiB)
# after: 77.669 ms (2 allocations: 7.63 MiB)
```
* Remove old intrinsic fpslt code
Co-authored-by: Mustafa Mohamad <mus-m@outlook.com>