PR #16130 CUDA: add a fused top-K MoE kernel

CUDA: add a fused top-K MoE kernel

am17an committed 46 days ago

Refactor into ggml_cuda_should_use_topk_moe

am17an committed 46 days ago

Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

am17an committed 46 days ago

Review: format + micro-optimizations

am17an committed 46 days ago

Fix bug: fix tie breakers

am17an committed 46 days ago

Add optional norm + clean-up code

am17an committed 46 days ago

Use smem for final write

am17an committed 46 days ago

Add bounds check

am17an committed 45 days ago

Use better memory pattern for writeback

am17an committed 45 days ago

llama.cpp CUDA: add a fused top-K MoE kernel #16130 Merged