llama.cpp
examples : add llama-eval
#21152
Merged

examples : add llama-eval #21152

ggerganov merged 66 commits into master from gg/scripts-eval
ggerganov
github-actions github-actions added examples
github-actions github-actions added python
strawberrymelonpanda
ggerganov
strawberrymelonpanda
strawberrymelonpanda
gatbontonpc working llama-eval mc and math suite
db8b09d6
gatbontonpc multi source llama-eval
4db4497c
gatbontonpc Add readme
c7f3ce25
gatbontonpc add checkpointing
5cbe95b6
ggerganov examples: add llama-server simulator for testing eval scripts
58bd57ba
ggerganov examples: refactor test-simulator.sh for better readability
05b8425b
ggerganov docs: update llama-eval-discussion.md with session work summary
deed0786
ggerganov examples: add simplified llama-eval-new.py for AIME evaluation
a2b96e04
ggerganov docs: remove README.md from llama-eval
de8eda46
ggerganov examples: implement flexible grader system for answer validation
0ca458d8
ggerganov examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
30ea5124
ggerganov examples: remove HF_HUB_OFFLINE to allow dataset download
d7d2c229
ggerganov examples: use cached dataset path to avoid HF Hub requests
edc766c9
ggerganov examples: use cached dataset path in simulator to avoid HF Hub requests
3732aea2
ggerganov docs: update llama-eval-discussion.md with session work summary
2fe445cc
ggerganov examples: add threading support and model parameter to llama-eval-new.py
fb40d1a0
ggerganov docs: update llama-eval-discussion.md with threading and model parame…
d639ee52
ggerganov examples: add task summary table to llama-eval-new.py
ee9b715e
ggerganov eval : print progress
940364e4
ggerganov eval : add prompts
1a780f7c
ggerganov test : fix path
64720e1e
ggerganov sim : fix answer matching
cda8cae0
ggerganov eval : support multiple dataset runs
530f38f9
ggerganov minor
9578e83a
ggerganov improve grader
4f176f6a
ggerganov docs
65e3c5a9
ggerganov remove old files
abec77e0
ggerganov datasets : add gsm8k
55ce1b4e
ggerganov add gpqa + sampling + docs
e7b86460
ggerganov rename
9f02fa63
ggerganov grader : improve example answers
6e7e1a5a
ggerganov cont
55a7cf4a
ggerganov datasets : add aime2025
f99d77f3
ggerganov grader : update prompt
8b94ab4f
ggerganov grade : improve regex + logs
122dfe3e
ggerganov datasets : fix aime2025
f20b5a72
ggerganov cleanup
91bd92c6
ggerganov add AGENTS.md
802d85e2
ggerganov ignore errors
f35b10f0
ggerganov resume eval
d830acac
ggerganov cleanup
095c8ab6
ggerganov fix counts
f95f4dd1
ggerganov simplify
2e0b6766
ggerganov fix prompts
7e8c88c5
ggerganov add html
36497938
ggerganov store full response
6797d80d
ggerganov add tokens
fc571f3a
ggerganov resoning and error handling
752b703a
ggerganov refactor
bad9565a
ggerganov track total time
e0a2cf48
ggerganov remove junk
633a68d6
ggerganov eval : unify "judge" terminology to "grader"
7d433f76
ggerganov eval : add Wilson score confidence interval to results
81a65cf0
ggerganov ggerganov force pushed from 1c128d94 to 81a65cf0 17 days ago
ggerganov llama-eval : add per-task generation speed from server timings
4d5dedc5
ggerganov llama-eval : add per-task generation time from server timings
9f10d8d1
ggerganov llama-eval : rename display, escaped, and count variables to use pref…
d26b1ffc
ggerganov llama-eval : support multiple evaluation endpoints with dynamic task …
43f14a0a
ggerganov llama-server-simulator : replace Flask with stdlib http.server
f64d56bc
ggerganov llama-eval : update README with PR link and quick-start examples
094554db
ggerganov ggerganov marked this pull request as ready for review 16 days ago
ggerganov ggerganov requested a review from copilot-pull-request-reviewer copilot-pull-request-reviewer 16 days ago
copilot-pull-request-reviewer
copilot-pull-request-reviewer commented on 2026-05-10
ggerganov llama-eval : track model name in eval state and verify on resume
e5ac6d1d
ggerganov llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
85c6aa00
ggerganov llama-eval : require --grader-model or --model when using --grader-ty…
d5165e8f
ggerganov llama-eval : protect dump() with lock for thread safety
f49c636d
ggerganov llama-eval : compact HTML report output
eda7b07d
ggerganov llama-eval : check server connectivity on startup
56465e96
ggerganov llama-eval : use server1/server2 instead of gpu1/gpu2 in README
f634472a
ggerganov ggerganov merged fde69a36 into master 15 days ago
cmp-nct
ggerganov ggerganov deleted the gg/scripts-eval branch 15 days ago
JohannesGaessler
ggerganov

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone