examples : add llama-eval (#21152)

Commit

17 hours ago

examples : add llama-eval (#21152) * working llama-eval mc and math suite * multi source llama-eval * Add readme * add checkpointing * examples: add llama-server simulator for testing eval scripts Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality. * examples: refactor test-simulator.sh for better readability Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses. * docs: update llama-eval-discussion.md with session work summary Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring. * examples: add simplified llama-eval-new.py for AIME evaluation - Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers * docs: remove README.md from llama-eval * examples: implement flexible grader system for answer validation - Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers * examples: use HF_HUB_OFFLINE to avoid HF Hub warnings * examples: remove HF_HUB_OFFLINE to allow dataset download * examples: use cached dataset path to avoid HF Hub requests * examples: use cached dataset path in simulator to avoid HF Hub requests * docs: update llama-eval-discussion.md with session work summary * examples: add threading support and model parameter to llama-eval-new.py - Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution * docs: update llama-eval-discussion.md with threading and model parameter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features * examples: add task summary table to llama-eval-new.py * eval : print progress * eval : add prompts * test : fix path * sim : fix answer matching * eval : support multiple dataset runs * minor * improve grader * docs * remove old files * datasets : add gsm8k * add gpqa + sampling + docs * rename * grader : improve example answers * cont * datasets : add aime2025 * grader : update prompt * grade : improve regex + logs * datasets : fix aime2025 * cleanup * add AGENTS.md * ignore errors * resume eval * cleanup * fix counts * simplify * fix prompts * add html * store full response * add tokens * resoning and error handling * refactor * track total time * remove junk * eval : unify "judge" terminology to "grader" Replace all occurrences of "judge" with "grader" for consistency across the codebase (CLI args, Grader class fields, help text). Assisted-by: llama.cpp:local pi * eval : add Wilson score confidence interval to results Compute 95% CI on-the-fly from completed cases. Displayed in terminal output, HTML report, and JSON state. * llama-eval : add per-task generation speed from server timings Extract predicted_per_second from the server timings response and store it as tps_gen per task. Display in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : add per-task generation time from server timings Extract predicted_ms from the server timings response and store it as t_gen_ms per task. Display in seconds with one decimal digit in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : rename display, escaped, and count variables to use prefix convention - _display suffix → display_ prefix (answer, tokens, tps, t_gen) - _escaped suffix → escaped_ prefix (response, prompt, reasoning) - _count suffix → n_ prefix (correct, incorrect, pending) Assisted-by: llama.cpp:local pi * llama-eval : support multiple evaluation endpoints with dynamic task distribution - Add ServerConfig dataclass (url, threads, name) - Accept comma-separated --server, --threads, --server-name CLI args - Dynamic shared-queue task distribution across servers (fast servers do more work) - One ThreadPoolExecutor per server, workers pull from shared Queue - Track which server processed each task (server_name in results) - Thread-safe EvalState with threading.Lock for concurrent mutations - Server column in HTML report and console output - Backward compatible: single server works as before Assisted-by: llama.cpp:local pi * llama-server-simulator : replace Flask with stdlib http.server - Use HTTPServer + BaseHTTPRequestHandler instead of Flask - RequestHandler handles POST /v1/chat/completions - Server runs in daemon thread with clean Ctrl+C shutdown - Remove flask and unused asdict imports Assisted-by: llama.cpp:local pi * llama-eval : update README with PR link and quick-start examples Assisted-by: llama.cpp:local pi * llama-eval : track model name in eval state and verify on resume - Store model_name in EvalState and JSON output - Display model in HTML summary table - Verify --model matches stored model when resuming Assisted-by: llama.cpp:local pi * llama-server-simulator : fix comment - Dice coefficient, not Levenshtein Assisted-by: llama.cpp:local pi * llama-eval : require --grader-model or --model when using --grader-type llm Assisted-by: llama.cpp:local pi * llama-eval : protect dump() with lock for thread safety Assisted-by: llama.cpp:local pi * llama-eval : compact HTML report output - Replace verbose summary table with single inline bar - Shorten status text: '✓'/'✗'/'–'/'!' instead of full words - Flatten CSS: remove box-shadows, border-radius, reduce padding - Use system-ui font, 13px table, 12px details - Conditional reasoning section (only shown when present) - Single toggle JS function instead of two - Shorter column headers Assisted-by: llama.cpp:local pi * llama-eval : check server connectivity on startup - Hit /v1/models for each server before evaluation - Exit with error if any server is unreachable - Print comma-separated model IDs per server in startup output - Sequential checks, no retries, no timeout override Assisted-by: llama.cpp:local pi * llama-eval : use server1/server2 instead of gpu1/gpu2 in README Assisted-by: llama.cpp:local pi --------- Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>

References

#21152 - examples : add llama-eval

Author

ggerganov

Parents

ef93e98d

llama.cpp fde69a36 - examples : add llama-eval (#21152)

llama.cpp
fde69a36 - examples : add llama-eval (#21152)