llama.cpp
de562791 - vulkan: Optimize argsort (#15354)

Commit
54 days ago
vulkan: Optimize argsort (#15354) - Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.
Author
Parents
Loading