llama.cpp
077c94d0 - CUDA: add a fused top-K MoE kernel (#16130)

Commit

45 days ago

CUDA: add a fused top-K MoE kernel (#16130) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback

References

#16130 - CUDA: add a fused top-K MoE kernel

Author

am17an

Parents

aa3ee0eb

llama.cpp 077c94d0 - CUDA: add a fused top-K MoE kernel (#16130)

llama.cpp
077c94d0 - CUDA: add a fused top-K MoE kernel (#16130)