PR #17004 sampling : add support for backend sampling

github-actions added testing

danbev force pushed 73 days ago

danbev force pushed 72 days ago

github-actions added Nvidia GPU

github-actions added ggml

danbev force pushed 72 days ago

danbev force pushed 71 days ago

github-actions added examples

github-actions added server

danbev force pushed 70 days ago

danbev force pushed 69 days ago

danbev force pushed 68 days ago

danbev force pushed 67 days ago

slaren commented on 2025-11-11

danbev force pushed 66 days ago

ORippler commented on 2025-11-12

danbev force pushed 65 days ago

danbev force pushed 64 days ago

danbev force pushed 62 days ago

github-actions added Apple Metal

danbev force pushed 61 days ago

ggerganov commented on 2025-11-17

danbev force pushed 61 days ago

danbev changed the title ~~sampling : add support for GPU sampling (wip)~~ sampling : add support for backend sampling (wip) 61 days ago

ggerganov commented on 2025-11-17

sampling : add support for backend sampling

7884b0e0

llama-cli : add backend sampler configuration

9fe9a00a

server : add backend sampling options/configuration

f1f3e685

webui : add backend sampling options

a3eb847d

ggml : add initial cumsum implementation for CUDA

67d3b8e8

danbev force pushed to 67d3b8e8 60 days ago

danbev changed the title ~~sampling : add support for backend sampling (wip)~~ sampling : add support for backend sampling 60 days ago

sampling : enable all backend sampler tests

71574f92

danbev marked this pull request as ready for review 60 days ago

danbev requested a review from

allozaur 60 days ago

danbev requested a review from

ngxson 60 days ago

danbev requested a review from

CISC 60 days ago

graph : do not include llama-model.h

4b52e599

ggerganov commented on 2025-11-18

sampling : always expose sampled_ids

82957a90

sampling : ensure at most one output token per seq

311c1a34

CUDA: Optimize argsort for gpu-based token sampling

26be108b

sampling : remove version from sampler chain

0da7e7dc

sampling : always populate logits for sampled probs

51fee298

sampling : simplify backend sampling logic decode

7e98ebcc

squash! sampling : simplify backend sampling logic decode

d74eb61a

common : fix regression caused by extra memory allocations during sam…

38f408c2

squash! sampling : simplify backend sampling logic decode

18ed4d8f

Merge remote-tracking branch 'upstream/master' into backend-sampling

0c660e73

squash! common : fix regression caused by extra memory allocations du…

ed4345bd

ORippler commented on 2025-11-20

sampling : introduce sampling_info struct

0d28b16b

sampling : return early if backend sampling is disabled

c1625620

sampling : use pinned memory for backend sampling buffers

61ffe41d

common, tools : refactor model loading to support backend samplers

9b243934

Merge remote-tracking branch 'upstream/master' into backend-sampling

79b8cf2a

sampling : add stride variable for clarity

65500d05

sampling: clarify candidate ids usage in comments

ae23d2d2

sampling : fix copying both sampled tokens and logits/probs from backend

9e273f7a

tests : cleanup test-backend-sampler.cpp

50d21aa4

Merge remote-tracking branch 'upstream/master' into backend-sampling

7816f0bb

common : remove build-info.cpp from commit [no ci]

d88ba181

sampling : cleanup and clarify output_reserve

4a90583d

sampling : remove redundant checks for stride and size [no ci]

8eb9b476

sampling : add debug log when backend sampler selects token

25f33806

examples : update batched to use backend sampling

d0bea21a

llama-cli : fix dangling reference to sampler config

e2d4f082

common : initialize backend samplers

b26c7069

samplers : add missing cont

883a8704

sampling : add assertions for contiguous tensors in async copy functions

a02adf42

Merge remote-tracking branch 'upstream/master' into backend-sampling

2b4c7927

examples : add info about hybrid sampling in batched [no ci]

0f17ccde

Merge remote-tracking branch 'upstream/master' into gpu-sampling

53dca56d

ggerganov commented on 2025-11-25

sampling : remove backend-dist option (wip)

9e5e09d0

Merge remote-tracking branch 'upstream/master' into backend-sampling

ec047e12

CUDA: Add top-k implementation

f23b306c

sampling : add min-p backend sampler

b45d504e

github-actions added build

Use `FetchContent` over CPM as it's bundled with CMake

4fea191c

common : add get_active_samplers function to check enabled samplers

0f7805f3

ORippler commented on 2025-11-26

cuda : fix editorconfig-checker warning

90a3aff2

Merge remote-tracking branch 'upstream/master' into backend-sampling

7c2bfb35

sampling : use argmax for min-p sampling

d9d73610

sampling : fix temperature check to allow zero temperature

51107a0b

cuda : fix top-k compilation when CUB is unavailable

5ea3be26

sampling : add comments about backend sampler [no ci]

172208af

sampling : remove backend sampling chain from common_sampler

e9d07098

Fix top-k comp & behavior for non-CUB path

f9889cf1

sampling : support intermixed backend/cpu samplers

74be332e

squash! sampling : support intermixed backend/cpu samplers

9ad6522b

squash! sampling : support intermixed backend/cpu samplers

459b7ae7

refactor : simplify and improve memory management

117e2079

ggerganov requested a review from

JohannesGaessler 49 days ago

Add initial version for top-p sampling

333da805

ORippler commented on 2025-11-28

sampling : use logits directly for min-p filtering

8cac9dee

sampling : simplify

2464d1b3

llama : simplify

fbc8f49f

llama : cleanup + naming

9028ebfe

Merge branch 'master' into HEAD

d8d98bb4

llama : call backend_init once

ff7b0bf6

Merge branch 'master' into HEAD

467746e3

llama : reserve graphs with samplers

1760bd69

llama : naming

c187003d

cont : naming

80742cba

sampling : lower log level for output buffer reallocations [no ci]

cf0e1475

Fix backend_top_p_sampler

8bee483c

Merge branch 'master' into HEAD

16451d6b

Factor out `ggml_sort` into its own function

ae0bb6a6

Make backend's top_p sampler inclusive

217469f0

common : simplify sampler chain initialization

4032ce23

sampling : do not create empty samplers

04f2822a

sampling : fix top_p empty condition

88cca45b

examples : remove outdated backend sampling section

988261b1

sampling : fix backend temp sampler for zero temperature

739b5978

Merge remote-tracking branch 'upstream/master' into gpu-sampling

3e9a258c

ggerganov commented on 2025-12-02

CUDA: Move cccl fetch to after cuda has been enabled in CMakeLists.txt

559d058d

CUDA: Use standard-compliant preprocessor for MSVC builds

244880ae

CUDA: Update CCCL's rc candidate

516af33c

squash! sampling : fix backend temp sampler for zero temperature

db8972e2

Merge remote-tracking branch 'upstream/master' into backend-sampling

2595818a

sampling : implement temp_ext_backend sampling

aad5a6af

sampling : minor cleanup

cce3b2a8

sampling : stop short if backend sampler sampled a token

87b2719e

Merge remote-tracking branch 'upstream/master' into backend-sampling

c0b182f4

Revert "sampling : stop short if backend sampler sampled a token"

10bd640a

sampling : fix backend temp sampling to use logits masking

ac9e1647

sampling : simplify temp sampling

fce571ee

sampling : remove redundant calls to ggml_build_forward_expand

1bde7078

sampling : check backend support during init

6958d413

cont : keep backend sampling disabled for now

abc19635

sampling : fix outputs and device checks

7864074f

allozaur approved these changes on 2025-12-05

sampling : fix candidates logic

cf74b1a8

Add perf-tests for CUMSUM

dd11f6eb

Merge branch 'master' into gpu-sampling

76689995

Readd `cub::DeviceScan::InclusiveSum`-based CumSum

e6525661

sampling : expand support (wip)

30742a6f

Merge branch 'master' into HEAD

fdac9686

tests : fix memory leaks

52258181

github-actions added python

cont : fixes

8ef5f900

tests : check temp back to 0.0

42125f0e

sampling : fix top-p

72e36810

Merge branch 'master' into HEAD

6d38db5d

sampling : handle n_probs case

f3beb22b

server : handle unsupported cases

560ac16f

metal : print node names for debugging

d62b5804

ggml : remove redundant src in ggml_cast

62d1b008

ggml-alloc : fix reuse-parent logic for misaligned sizes

9f6681c3

Revert "ggml : remove redundant src in ggml_cast"

7ab6f51b

CUDA: Add Cooperative-Groups-based parallelization of ncols in softmax

a84dfd3e

Add TODOs to and adjust heuristics of row-wise soft_max in CUDA

886c3668

Fix compiler warnings by casting `const` away

07003f1f

llama : require backend samplers to be of type llama_sampler_chain

92ff7679

sampling : use host buffer type for inputs

34b407b4

Try fixing HIP build errors by adding corresponding #defines

3f0594ad

Fix launch logic when supports_cooperative_launch=false

a25fda52

Disable cooperative groups for musa

6dc6614b

Merge branch 'master' into HEAD

81cb5783

server : reconnect the backend_sampling setting in the WebUI

0ecee8be

graph : make the compute graph constant with respect to active samplers

c02654eb

Merge branch 'master' into HEAD

38882247

JohannesGaessler commented on 2025-12-10

batch : fix sequence id ownage

44d5c4b5

graph : respect sampler order for graph reuse

804e7e37

HIP/MUSA: fix build for backend sampling

42cf5c01

Merge pull request #1 from JohannesGaessler/gpu-sampling-hip

56720f8f

sampling : optimize logit_bias sampler

54e90540

cont : fix build

d5d16651

sampling : generic ggml op support detection

8544aba3

sampling : fix greedy

74b112e3

tests : run backend sampler tests always on the CPU

ab65b47a

Merge branch 'master' into HEAD

4d10b78e

Apply suggestions from code review

07b809bb

Merge branch 'master' into HEAD

22c7f85b

Merge branch 'master' into HEAD

0086c246

webui : fix lint

2652e745

Fix data-race in `soft_max_f32_parallelize_cols_single_row`

3732b85b

Apply automated code-formating to softmax.cu

e5737f66

Merge remote-tracking branch 'upstream/master' into backend-sampling

ad1b60ab

llama : clarify backend_accept/backend_set_input comments [no ci]

68a1c4dc

llama : fix typo in comment [no ci]

c5d44b85

tests : use smart pointers for backend samplers

9a9ea2f6

tests : use smart pointers for model and context

98459969

tests : remove vocab member from test_model_context

76a1b7fe

tests : extract batch info update to separate method

cc31e6a2

tests : fix batch token position tracking in test_backend_sampler.cpp

a519aea3

tests : add --device option support to backend sampler tests

981475fe

Merge branch 'master' into HEAD

eefdb0da

common : disable backend sampling when grammar is involved

3b3f5fed

Merge remote-tracking branch 'upstream/master' into backend-sampling

bc5195c5

Fix different RNG-states between backend-sampling and llama-sampling

17509174

Make backend dist sampler use same rnd's as dist sampler

0a17687c

Update CCCL version to v3.2.0-rc2

b5ec0fd7

Build with CCCL 3.2 for CUDA backends

1da013c6

github-actions added devops

Merge remote-tracking branch 'upstream/master' into backend-sampling

f1310ab9

Merge branch 'master' into HEAD

0ce03597

tests : revert server test changes (no longer needed)

c0a351cc

Merge remote-tracking branch 'upstream/master' into backend-sampling

82c26005

ggml : include cub/cub.cuh instead of block_scan.cuh

060c0a58

Merge remote-tracking branch 'upstream/master' into backend-sampling

ebfe545c

arg : add shorthand for --backend-sampling

23e8bb40

ci : add server workflow with backend sampling

5d2156e8

sampling : fix reshapes

610e50a1

server : remove printfs

588299c2

Merge branch 'master' into HEAD

c5de7598

sampling : zero-initialize input buffers

791ecb94

minor : add comments + some cleanup

4c3d5422

llama : assert at most one output token per sequence

435c9670

tests : add more top_k tests

0d85c5ca

Merge branch 'master' into HEAD

8071a57c

CUDA: Fix non-determinism of CUB-based Top-K

b3cf4eb1

CUDA: Optimize index of top_k_cub

6975bda9

Apply code-formatting to top-k.cu

194401af

Merge remote-tracking branch 'origin/master' into gpu-sampling

9f6c1f33

CUDA: Remove obsolete temp_keys from CUB

03454de7

minor : cleanup, TODOs, etc.

2e54b1db

ggerganov merged d3dce4e0 into master 12 days ago

llama.cpp
sampling : add support for backend sampling
#17004

Merged

sampling : add support for backend sampling #17004

llama.cpp sampling : add support for backend sampling #17004 Merged

sampling : add support for backend sampling #17004

llama.cpp
sampling : add support for backend sampling
#17004

Merged