llama.cpp
sampling : add support for backend sampling
#17004
Merged
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
179
Changes
View On
GitHub
sampling : add support for backend sampling
#17004
ggerganov
merged 179 commits into
ggml-org:master
from
danbev:gpu-sampling
github-actions
added
testing
danbev
force pushed
73 days ago
danbev
force pushed
72 days ago
github-actions
added
Nvidia GPU
github-actions
added
ggml
danbev
force pushed
72 days ago
danbev
force pushed
72 days ago
danbev
force pushed
71 days ago
github-actions
added
examples
github-actions
added
server
danbev
force pushed
70 days ago
danbev
force pushed
69 days ago
danbev
force pushed
68 days ago
danbev
force pushed
67 days ago
danbev
force pushed
67 days ago
danbev
force pushed
67 days ago
danbev
force pushed
67 days ago
slaren
commented on 2025-11-11
danbev
force pushed
66 days ago
danbev
force pushed
66 days ago
danbev
force pushed
66 days ago
danbev
force pushed
66 days ago
ORippler
commented on 2025-11-12
danbev
force pushed
65 days ago
danbev
force pushed
64 days ago
danbev
force pushed
62 days ago
danbev
force pushed
62 days ago
danbev
force pushed
62 days ago
danbev
force pushed
62 days ago
github-actions
added
Apple Metal
danbev
force pushed
61 days ago
ggerganov
commented on 2025-11-17
ggerganov
commented on 2025-11-17
danbev
force pushed
61 days ago
danbev
force pushed
61 days ago
danbev
changed the title
sampling : add support for GPU sampling (wip)
sampling : add support for backend sampling (wip)
61 days ago
ggerganov
commented on 2025-11-17
sampling : add support for backend sampling
7884b0e0
llama-cli : add backend sampler configuration
9fe9a00a
server : add backend sampling options/configuration
f1f3e685
webui : add backend sampling options
a3eb847d
ggml : add initial cumsum implementation for CUDA
67d3b8e8
danbev
force pushed
to
67d3b8e8
60 days ago
danbev
changed the title
sampling : add support for backend sampling (wip)
sampling : add support for backend sampling
60 days ago
sampling : enable all backend sampler tests
71574f92
danbev
marked this pull request as ready for review
60 days ago
danbev
requested a review
from
allozaur
60 days ago
danbev
requested a review
from
ngxson
60 days ago
danbev
requested a review
from
CISC
60 days ago
graph : do not include llama-model.h
4b52e599
ggerganov
commented on 2025-11-18
sampling : always expose sampled_ids
82957a90
sampling : ensure at most one output token per seq
311c1a34
CUDA: Optimize argsort for gpu-based token sampling
26be108b
sampling : remove version from sampler chain
0da7e7dc
sampling : always populate logits for sampled probs
51fee298
sampling : simplify backend sampling logic decode
7e98ebcc
squash! sampling : simplify backend sampling logic decode
d74eb61a
common : fix regression caused by extra memory allocations during sam…
38f408c2
squash! sampling : simplify backend sampling logic decode
18ed4d8f
Merge remote-tracking branch 'upstream/master' into backend-sampling
0c660e73
squash! common : fix regression caused by extra memory allocations du…
ed4345bd
ORippler
commented on 2025-11-20
sampling : introduce sampling_info struct
0d28b16b
sampling : return early if backend sampling is disabled
c1625620
sampling : use pinned memory for backend sampling buffers
61ffe41d
common, tools : refactor model loading to support backend samplers
9b243934
Merge remote-tracking branch 'upstream/master' into backend-sampling
79b8cf2a
sampling : add stride variable for clarity
65500d05
sampling: clarify candidate ids usage in comments
ae23d2d2
sampling : fix copying both sampled tokens and logits/probs from backend
9e273f7a
tests : cleanup test-backend-sampler.cpp
50d21aa4
Merge remote-tracking branch 'upstream/master' into backend-sampling
7816f0bb
common : remove build-info.cpp from commit [no ci]
d88ba181
sampling : cleanup and clarify output_reserve
4a90583d
sampling : remove redundant checks for stride and size [no ci]
8eb9b476
sampling : add debug log when backend sampler selects token
25f33806
examples : update batched to use backend sampling
d0bea21a
llama-cli : fix dangling reference to sampler config
e2d4f082
common : initialize backend samplers
b26c7069
samplers : add missing cont
883a8704
sampling : add assertions for contiguous tensors in async copy functions
a02adf42
Merge remote-tracking branch 'upstream/master' into backend-sampling
2b4c7927
examples : add info about hybrid sampling in batched [no ci]
0f17ccde
Merge remote-tracking branch 'upstream/master' into gpu-sampling
53dca56d
ggerganov
commented on 2025-11-25
sampling : remove backend-dist option (wip)
9e5e09d0
Merge remote-tracking branch 'upstream/master' into backend-sampling
ec047e12
CUDA: Add top-k implementation
f23b306c
sampling : add min-p backend sampler
b45d504e
github-actions
added
build
Use `FetchContent` over CPM as it's bundled with CMake
4fea191c
common : add get_active_samplers function to check enabled samplers
0f7805f3
ORippler
commented on 2025-11-26
cuda : fix editorconfig-checker warning
90a3aff2
Merge remote-tracking branch 'upstream/master' into backend-sampling
7c2bfb35
sampling : use argmax for min-p sampling
d9d73610
sampling : fix temperature check to allow zero temperature
51107a0b
cuda : fix top-k compilation when CUB is unavailable
5ea3be26
sampling : add comments about backend sampler [no ci]
172208af
sampling : remove backend sampling chain from common_sampler
e9d07098
Fix top-k comp & behavior for non-CUB path
f9889cf1
sampling : support intermixed backend/cpu samplers
74be332e
squash! sampling : support intermixed backend/cpu samplers
9ad6522b
squash! sampling : support intermixed backend/cpu samplers
459b7ae7
refactor : simplify and improve memory management
117e2079
ggerganov
requested a review
from
JohannesGaessler
49 days ago
Add initial version for top-p sampling
333da805
ORippler
commented on 2025-11-28
sampling : use logits directly for min-p filtering
8cac9dee
sampling : simplify
2464d1b3
llama : simplify
fbc8f49f
llama : cleanup + naming
9028ebfe
Merge branch 'master' into HEAD
d8d98bb4
llama : call backend_init once
ff7b0bf6
Merge branch 'master' into HEAD
467746e3
llama : reserve graphs with samplers
1760bd69
llama : naming
c187003d
cont : naming
80742cba
sampling : lower log level for output buffer reallocations [no ci]
cf0e1475
Fix backend_top_p_sampler
8bee483c
Merge branch 'master' into HEAD
16451d6b
Factor out `ggml_sort` into its own function
ae0bb6a6
Make backend's top_p sampler inclusive
217469f0
common : simplify sampler chain initialization
4032ce23
sampling : do not create empty samplers
04f2822a
sampling : fix top_p empty condition
88cca45b
examples : remove outdated backend sampling section
988261b1
sampling : fix backend temp sampler for zero temperature
739b5978
Merge remote-tracking branch 'upstream/master' into gpu-sampling
3e9a258c
ggerganov
commented on 2025-12-02
CUDA: Move cccl fetch to after cuda has been enabled in CMakeLists.txt
559d058d
CUDA: Use standard-compliant preprocessor for MSVC builds
244880ae
CUDA: Update CCCL's rc candidate
516af33c
squash! sampling : fix backend temp sampler for zero temperature
db8972e2
Merge remote-tracking branch 'upstream/master' into backend-sampling
2595818a
sampling : implement temp_ext_backend sampling
aad5a6af
sampling : minor cleanup
cce3b2a8
sampling : stop short if backend sampler sampled a token
87b2719e
Merge remote-tracking branch 'upstream/master' into backend-sampling
c0b182f4
Revert "sampling : stop short if backend sampler sampled a token"
10bd640a
sampling : fix backend temp sampling to use logits masking
ac9e1647
sampling : simplify temp sampling
fce571ee
sampling : remove redundant calls to ggml_build_forward_expand
1bde7078
sampling : check backend support during init
6958d413
cont : keep backend sampling disabled for now
abc19635
sampling : fix outputs and device checks
7864074f
allozaur
approved these changes on 2025-12-05
sampling : fix candidates logic
cf74b1a8
Add perf-tests for CUMSUM
dd11f6eb
Merge branch 'master' into gpu-sampling
76689995
Readd `cub::DeviceScan::InclusiveSum`-based CumSum
e6525661
sampling : expand support (wip)
30742a6f
Merge branch 'master' into HEAD
fdac9686
tests : fix memory leaks
52258181
github-actions
added
python
cont : fixes
8ef5f900
tests : check temp back to 0.0
42125f0e
sampling : fix top-p
72e36810
Merge branch 'master' into HEAD
6d38db5d
sampling : handle n_probs case
f3beb22b
server : handle unsupported cases
560ac16f
metal : print node names for debugging
d62b5804
ggml : remove redundant src in ggml_cast
62d1b008
ggml-alloc : fix reuse-parent logic for misaligned sizes
9f6681c3
Revert "ggml : remove redundant src in ggml_cast"
7ab6f51b
CUDA: Add Cooperative-Groups-based parallelization of ncols in softmax
a84dfd3e
Add TODOs to and adjust heuristics of row-wise soft_max in CUDA
886c3668
Fix compiler warnings by casting `const` away
07003f1f
llama : require backend samplers to be of type llama_sampler_chain
92ff7679
sampling : use host buffer type for inputs
34b407b4
Try fixing HIP build errors by adding corresponding #defines
3f0594ad
Fix launch logic when supports_cooperative_launch=false
a25fda52
Disable cooperative groups for musa
6dc6614b
Merge branch 'master' into HEAD
81cb5783
server : reconnect the backend_sampling setting in the WebUI
0ecee8be
graph : make the compute graph constant with respect to active samplers
c02654eb
Merge branch 'master' into HEAD
38882247
JohannesGaessler
commented on 2025-12-10
batch : fix sequence id ownage
44d5c4b5
graph : respect sampler order for graph reuse
804e7e37
HIP/MUSA: fix build for backend sampling
42cf5c01
Merge pull request #1 from JohannesGaessler/gpu-sampling-hip
56720f8f
sampling : optimize logit_bias sampler
54e90540
cont : fix build
d5d16651
sampling : generic ggml op support detection
8544aba3
sampling : fix greedy
74b112e3
tests : run backend sampler tests always on the CPU
ab65b47a
Merge branch 'master' into HEAD
4d10b78e
Apply suggestions from code review
07b809bb
Merge branch 'master' into HEAD
22c7f85b
Merge branch 'master' into HEAD
0086c246
webui : fix lint
2652e745
Fix data-race in `soft_max_f32_parallelize_cols_single_row`
3732b85b
Apply automated code-formating to softmax.cu
e5737f66
Merge remote-tracking branch 'upstream/master' into backend-sampling
ad1b60ab
llama : clarify backend_accept/backend_set_input comments [no ci]
68a1c4dc
llama : fix typo in comment [no ci]
c5d44b85
tests : use smart pointers for backend samplers
9a9ea2f6
tests : use smart pointers for model and context
98459969
tests : remove vocab member from test_model_context
76a1b7fe
tests : extract batch info update to separate method
cc31e6a2
tests : fix batch token position tracking in test_backend_sampler.cpp
a519aea3
tests : add --device option support to backend sampler tests
981475fe
Merge branch 'master' into HEAD
eefdb0da
common : disable backend sampling when grammar is involved
3b3f5fed
Merge remote-tracking branch 'upstream/master' into backend-sampling
bc5195c5
Fix different RNG-states between backend-sampling and llama-sampling
17509174
Make backend dist sampler use same rnd's as dist sampler
0a17687c
Update CCCL version to v3.2.0-rc2
b5ec0fd7
Build with CCCL 3.2 for CUDA backends
1da013c6
github-actions
added
devops
Merge remote-tracking branch 'upstream/master' into backend-sampling
f1310ab9
Merge branch 'master' into HEAD
0ce03597
tests : revert server test changes (no longer needed)
c0a351cc
Merge remote-tracking branch 'upstream/master' into backend-sampling
82c26005
ggml : include cub/cub.cuh instead of block_scan.cuh
060c0a58
Merge remote-tracking branch 'upstream/master' into backend-sampling
ebfe545c
arg : add shorthand for --backend-sampling
23e8bb40
ci : add server workflow with backend sampling
5d2156e8
sampling : fix reshapes
610e50a1
server : remove printfs
588299c2
Merge branch 'master' into HEAD
c5de7598
sampling : zero-initialize input buffers
791ecb94
minor : add comments + some cleanup
4c3d5422
llama : assert at most one output token per sequence
435c9670
tests : add more top_k tests
0d85c5ca
Merge branch 'master' into HEAD
8071a57c
CUDA: Fix non-determinism of CUB-based Top-K
b3cf4eb1
CUDA: Optimize index of top_k_cub
6975bda9
Apply code-formatting to top-k.cu
194401af
Merge remote-tracking branch 'origin/master' into gpu-sampling
9f6c1f33
CUDA: Remove obsolete temp_keys from CUB
03454de7
minor : cleanup, TODOs, etc.
2e54b1db
ggerganov
merged
d3dce4e0
into master
12 days ago
Login to write a write a comment.
Login via GitHub
Reviewers
allozaur
JohannesGaessler
ggerganov
slaren
ORippler
ngxson
CISC
Assignees
No one assigned
Labels
build
testing
Nvidia GPU
examples
python
devops
server
ggml
Apple Metal
Milestone
No milestone
Login to write a write a comment.
Login via GitHub