llama.cpp
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case
#12183
Merged

Commits
  • CUDA: determine FA parallel blocks at runtime
    gaugarg-nv committed 283 days ago
  • CUDA: Improve flash decoding kernel occupancy for BS=1 case
    gaugarg-nv committed 283 days ago
  • consider tail effects for parallel_blocks
    JohannesGaessler committed 283 days ago
Loading