PR #21152 examples : add llama-eval

examples : add llama-eval #21152

ggerganov merged 66 commits into master from gg/scripts-eval

github-actions added examples

github-actions added python

working llama-eval mc and math suite

db8b09d6

multi source llama-eval

4db4497c

Add readme

c7f3ce25

add checkpointing

5cbe95b6

examples: add llama-server simulator for testing eval scripts

58bd57ba

examples: refactor test-simulator.sh for better readability

05b8425b

docs: update llama-eval-discussion.md with session work summary

deed0786

examples: add simplified llama-eval-new.py for AIME evaluation

a2b96e04

docs: remove README.md from llama-eval

de8eda46

examples: implement flexible grader system for answer validation

0ca458d8

examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

30ea5124

examples: remove HF_HUB_OFFLINE to allow dataset download

d7d2c229

examples: use cached dataset path to avoid HF Hub requests

edc766c9

examples: use cached dataset path in simulator to avoid HF Hub requests

3732aea2

docs: update llama-eval-discussion.md with session work summary

2fe445cc

examples: add threading support and model parameter to llama-eval-new.py

fb40d1a0

docs: update llama-eval-discussion.md with threading and model parame…

d639ee52

examples: add task summary table to llama-eval-new.py

ee9b715e

eval : print progress

940364e4

eval : add prompts

1a780f7c

test : fix path

64720e1e

sim : fix answer matching

cda8cae0

eval : support multiple dataset runs

530f38f9

minor

9578e83a

improve grader

4f176f6a

docs

65e3c5a9

remove old files

abec77e0

datasets : add gsm8k

55ce1b4e

add gpqa + sampling + docs

e7b86460

rename

9f02fa63

grader : improve example answers

6e7e1a5a

cont

55a7cf4a

datasets : add aime2025

f99d77f3

grader : update prompt

8b94ab4f

grade : improve regex + logs

122dfe3e

datasets : fix aime2025

f20b5a72

cleanup

91bd92c6

add AGENTS.md

802d85e2

ignore errors

f35b10f0

resume eval

d830acac

cleanup

095c8ab6

fix counts

f95f4dd1

simplify

2e0b6766

fix prompts

7e8c88c5

add html

36497938

store full response

6797d80d

add tokens

fc571f3a

resoning and error handling

752b703a

refactor

bad9565a

track total time

e0a2cf48

remove junk

633a68d6

eval : unify "judge" terminology to "grader"

7d433f76

eval : add Wilson score confidence interval to results

81a65cf0

ggerganov force pushed from 1c128d94 to 81a65cf0 17 days ago

llama-eval : add per-task generation speed from server timings

4d5dedc5

llama-eval : add per-task generation time from server timings

9f10d8d1

llama-eval : rename display, escaped, and count variables to use pref…

d26b1ffc

llama-eval : support multiple evaluation endpoints with dynamic task …

43f14a0a

llama-server-simulator : replace Flask with stdlib http.server

f64d56bc

llama-eval : update README with PR link and quick-start examples

094554db

ggerganov marked this pull request as ready for review 16 days ago

ggerganov requested a review from

copilot-pull-request-reviewer 16 days ago

copilot-pull-request-reviewer commented on 2026-05-10

llama-eval : track model name in eval state and verify on resume

e5ac6d1d

llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

85c6aa00

llama-eval : require --grader-model or --model when using --grader-ty…

d5165e8f

llama-eval : protect dump() with lock for thread safety

f49c636d

llama-eval : compact HTML report output

eda7b07d

llama-eval : check server connectivity on startup

56465e96

llama-eval : use server1/server2 instead of gpu1/gpu2 in README

f634472a

ggerganov merged fde69a36 into master 15 days ago

ggerganov deleted the gg/scripts-eval branch 15 days ago

Reviewers

copilot-pull-request-reviewer

Assignees

No one assigned

Labels

examples python

Milestone

No milestone

llama.cpp examples : add llama-eval #21152 Merged

examples : add llama-eval #21152

llama.cpp
examples : add llama-eval
#21152

Merged