examples : add llama-eval #21152
working llama-eval mc and math suite
db8b09d6
multi source llama-eval
4db4497c
Add readme
c7f3ce25
add checkpointing
5cbe95b6
examples: add llama-server simulator for testing eval scripts
58bd57ba
examples: refactor test-simulator.sh for better readability
05b8425b
docs: update llama-eval-discussion.md with session work summary
deed0786
examples: add simplified llama-eval-new.py for AIME evaluation
a2b96e04
docs: remove README.md from llama-eval
de8eda46
examples: implement flexible grader system for answer validation
0ca458d8
examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
30ea5124
examples: remove HF_HUB_OFFLINE to allow dataset download
d7d2c229
examples: use cached dataset path to avoid HF Hub requests
edc766c9
examples: use cached dataset path in simulator to avoid HF Hub requests
3732aea2
docs: update llama-eval-discussion.md with session work summary
2fe445cc
examples: add threading support and model parameter to llama-eval-new.py
fb40d1a0
docs: update llama-eval-discussion.md with threading and model parameā¦
d639ee52
examples: add task summary table to llama-eval-new.py
ee9b715e
eval : print progress
940364e4
eval : add prompts
1a780f7c
test : fix path
64720e1e
sim : fix answer matching
cda8cae0
eval : support multiple dataset runs
530f38f9
minor
9578e83a
improve grader
4f176f6a
docs
65e3c5a9
remove old files
abec77e0
datasets : add gsm8k
55ce1b4e
add gpqa + sampling + docs
e7b86460
rename
9f02fa63
grader : improve example answers
6e7e1a5a
cont
55a7cf4a
datasets : add aime2025
f99d77f3
grader : update prompt
8b94ab4f
grade : improve regex + logs
122dfe3e
datasets : fix aime2025
f20b5a72
cleanup
91bd92c6
add AGENTS.md
802d85e2
ignore errors
f35b10f0
resume eval
d830acac
cleanup
095c8ab6
fix counts
f95f4dd1
simplify
2e0b6766
fix prompts
7e8c88c5
add html
36497938
store full response
6797d80d
add tokens
fc571f3a
resoning and error handling
752b703a
refactor
bad9565a
track total time
e0a2cf48
remove junk
633a68d6
eval : unify "judge" terminology to "grader"
7d433f76
eval : add Wilson score confidence interval to results
81a65cf0
ggerganov
force pushed
from
1c128d94
to
81a65cf0
17 days ago
llama-eval : add per-task generation speed from server timings
4d5dedc5
llama-eval : add per-task generation time from server timings
9f10d8d1
llama-eval : rename display, escaped, and count variables to use prefā¦
d26b1ffc
llama-eval : support multiple evaluation endpoints with dynamic task ā¦
43f14a0a
llama-server-simulator : replace Flask with stdlib http.server
f64d56bc
llama-eval : update README with PR link and quick-start examples
094554db
ggerganov
marked this pull request as ready for review 16 days ago
llama-eval : track model name in eval state and verify on resume
e5ac6d1d
llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
85c6aa00
llama-eval : require --grader-model or --model when using --grader-tyā¦
d5165e8f
llama-eval : protect dump() with lock for thread safety
f49c636d
llama-eval : compact HTML report output
eda7b07d
llama-eval : check server connectivity on startup
56465e96
llama-eval : use server1/server2 instead of gpu1/gpu2 in README
f634472a
ggerganov
merged
fde69a36
into master 15 days ago
ggerganov
deleted the gg/scripts-eval branch 15 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub